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-Abstract- 

We study a class of aggregate-join queries with multiple aggregation operators evaluated over an¬ 
notated relations. We show that straightforward extensions of standard multiway join algorithms 
and generalized hypertree decompositions (GHDs) provide best-known runtime guarantees. In 
contrast, prior work uses bespoke algorithms and data structures and does not match these guar¬ 
antees. Our extensions to the standard techniques are a pair of simple tests that (1) determine if 
two orderings of aggregation operators are equivalent and (2) determine if a GHD is compatible 
with a given ordering. These tests provide a means to find an optimal GHD that, when provided 
to standard join algorithms, will correctly answer a given aggregate-join query. The second class 
of our contributions is a pair of complete characterizations of (1) the set of orderings equivalent 
to a given ordering and (2) the set of GHDs compatible with some equivalent ordering. We 
show by example that previous approaches are incomplete. The key technical consequence of our 
characterizations is a decomposition of a compatible GHD into a set of (smaller) unconstrained 
GHDs, i.e. into a set of GHDs of sub-queries without aggregations. Since this decomposition is 
comprised of unconstrained GHDs, we are able to connect to the wide literature on GHDs for 
join query processing, thereby obtaining improved runtime bounds, MapReduce variants, and an 
efficient method to hnd approximately optimal GHDs. 



Introduction 


Generalized hypertree decompositions (GHDs), introduced by Gottlob et al. and further 

developed by Grohe and Marx j^, provide a means for performing early projection in join 
processing, which can result in dramatically faster runtimes. In this work, we extend GHDs 
to handle queries that include aggregations, which allows us to capture both SQL-aggregate 
processing and message passing problems. Motivated by our own database engine based on 
GHDs [^[^[^, we seek to more deeply understand the space of optimization for aggregate-join 
queries. 

We build upon work by Green, Karvounarakis, and Tannen on annotated relations to 
dehne our notion of aggregation. These annotations provide a general dehnition of aggregation, 
allowing us to represent a wide-ranging set of problems as aggregate-join queries. Our queries, 
which we call Ajar (Aggregations and Joins over Annotated Relations) queries, contain semiring 
quantifiers that “sum over” or “marginahze out” values. We formally define Ajar queries in 
Section]^ but they are easy to illustrate by example: 


► Example 1. Gonsider two relations with attributes {A, B} and {B, C} such that each tuple is 
annotated with some integer; we call these relations Z-relations. Gonsider the query: 


EE R{A,B) N S{B,C) 

C B 


Our output will then be a Z-relation with attribute set {A}. Each value a of attribute A in i? is 
associated with a set Xa of pairs (&, zb) composed of a value b of attribute B and an annotation 
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M Figure 1 Illustrating the computation of Example 
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Zjt- Furthermore for each b value in Xa, there is a set Xi, 
of a value c of attribute C and an annotation zs- Given 
valne a, the annotation associated with a in om outpnt 

b,ZiE.Xa C,Z2^Xb 

Ajar qneries capture both classical SQL-style queries and newer data processing problems like 
probabilistic inference via message passing on graphical models [^. In fact, Aji and McEliece 
proposed the “Marginalize a Product Function” (MPF) problem [^, which is a special case 
of an Ajar qnery, and showed how the problem and its solntion captnre a nnmber of classic 
problems and algorithms, including fast Hadamard transforms, Viterbi’s algorithm, forward- 
backward algorithm, FFT, and probabilistic inference in Bayesian networks. These algorithm 
are fnndamental to various fields; for example the forward-backward algorithm over conditional 
random fields forms the basis for state of the art solntions to named entity recognition, part of 
speech tagging, nonn phrase segmentation, and other problems in NLP |^. We are motivated 
by the wide applicability of qneries over annotated relations; annotated relations may provide a 
framework for combining classical qnery processing, linear algebra, and statistical inference in a 
single data processing system. 

We consider a generalization of MPF with multiple aggregation operators. We represent an 
aggregate-join qnery as a join Q and an aggregation ordering, which specifies both the order¬ 
ing and the aggregation of each attribnte. Our language directly follows from the work of Abo 
Khamis, Ngo, and Rndra [^, who investigated the “Fnnctional Aggregate Qnery” (FAQ) prob¬ 
lem. In addition to MPF, FAQ is a generalization of Chen and Dalman’s QCQ problem IZ] , in 
which the only aggregates are logical quantifiers (AND and OR). 

The key technical challenge in both problems is characterizing the permissible aggregations 
orders to answer the qnery. Chen and Dalmau give a complete characterization of which variable 
orders are permissible for QCQ via a procednre. We first give a simple (complete) procednre for 
onr more general class of qneries with mnltiple aggregations, and then we provide a complete 
characterization of permissible orders. 

H A Simple Test for Equivalence: A qnery can be thonght of as a body Q and a string of 
attribnte-operator pairs a. Given a qnery Q and two orders a and P, we provide a simple 
test to determine whether a and /3 are equivalent (i.e., retnrn the same outpnt for any input 
database). The technical challenge is that different aggregation operators (e.g., ^ and max) 
cannot freely commute. We show that attribute-operator pairs can commute for only two 
reasons: (1) their operators commnte or (2) their attributes are “independent” in the qnery, 
e.g., in the query mine max^ R{A, B), S{B, C) the aggregations involving A and C can 
commute - even thongh max and E commute as operators, the query body renders 

them independent given B. We show that these two conditions are complete, which leads to 
a simple test for equivalence (Algorithm]^. 

B A Simple Test for GHD and Order Compatibility: We say a GHD is compatible with an 
ordering if we can rnn standard join algorithms on the GHD while performing aggregations in 



from relation S of pairs (c, Zs) composed 
Xa and each Xt associated with a given 
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the order given by the ordering. We show that testing for compatibility amounts to verifying 
that for any two attributes A, B, if the topmost GHD node containing A occurs above the 
topmost node containing B, then A occurs before B in the ordering. 


This pair of results gives us a simple algorithm that achieves the best known runtime results. 
Given a query {Q,a), enumerate each order /3 and each GHD G, checking if a is equivalent to 
(3 and G is compatible with (3. If so, record the cost of solving the query using G, according to 
(say) fractional hypertreewidth. Solve the query using the lowest cost (G, /3) with a standard join 
algorithm 


12 


0 


The preceding simple algorithm runs in time exponential in the query size. But hnding the 
optimal GHD even without aggregation is an NP-hard problem, so the brute force optimizer 
has essentially optimal runtime. It is easy to implement, and a variant is in our prototype 
database [T 26 


The more interesting problem is to characterize the notions of equivalence, mirroring Ghen 
and Dalmau. To that end, we give two new, complete characterizations: 


B A Complete Characterization of Equivalent Orders: Given an order a and two attribute- 
operator pairs x,y € we describe a set of constraints of the form “in any order, x must 
appear after y.” Our constraints are sound and complete, i.e., a string (3 satishes these 
constraints if and only if it is equivalent to a. In contrast, previous approaches have an 
incomplete characterization, as shown in Example in the Appendix. 

B A Complete Characterization of GHDs compatible with any Equivalent Order. Given an order 
a and a query hypergraph Q, we call a GHD ‘valid’ if it is compatible with any ordering 
equivalent to a. We give a succinct characterization for all valid GHDs. We then describe a 
decomposition of the query (Q, a) into a series of characteristic hypergraphs (without attached 
aggregation orderings). GHDs for these hypergraphs can be combined into a valid GHD for 
the original query. We show that for any “node-monotone” width function, there is a 
GHD with optimal width w that can be constructed with this decomposition. Treewidth, 
Fractional hypertreewidth, and Submodular width are all node-monotone. 

Gonceptually, we think the latter result is especially important for tying our work to existing 
GHD literature; the result reduces our problem to operating on standard GHDs. Pragmatically, 
we can apply existing GHD results to our characteristic hypergraphs and obtain the following 
results for free: 


B Based on Grohe and Marx |^, we are able to describe our runtime in terms of classical 
metrics like fractional hypertreewidth. In turn, we can use standard notions to upper bound 
the runtime like fractional hypertree width, Marx’s submodular width or Joglekar’s 
efficiently computable variant (I^ . 

B Based on Afrati et al. [^, who bound the communication costs of join processing in terms of 
a “width” parameter for GHDs, we can develop efficient MapReduce algorithms for solving 
Ajar queries. 


^ Two technical notes: (1) methods like submodular width |l7| or Joglekar and Re [l3| require that we 
first partition the instances and then run the above algorithrn; (2) FAQ [15| is not output sensitive (it 
does not use GHDs), and so it handles output attributes less efficiently than the above algorithm, as 
seen in Example |58[ 

^ Informally, a map is node monotone if adding more nodes to a graph does not reduce the measure, but 
additional edges may reduce the measure, see Definition |28| 

^ In contrast, FAQ’s decomposition strategy may miss the optimal GHD. Appendix Example |33| shows a 
case in which using the EAQ decomposition gives a width 2n while AJAR obtains width n for n > 1. 
We also exhibit a family of queries and instances on whic h FAQ runs in time while AJAR 

runs in time 0{N'^) for n > 1, see Appendix Example |60| 
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B Based on Marx’s approximation for GHDs, we can find approximately optimal GHDs for 
the popular fractional hypertreewidth measure in polynomial time. 

We get the above results essentially for free from forging this connection to GHDs. We view 
this simple link as a strength of our approach. 

Finally, we discuss an extension to handle “product aggregations” that allows us to aggregate 
away an attribute before we join the relations containing the attribute when the aggregation 
operator is the multiplication operator of the semiring. FAQ was the hrst to observe that this 
special case can improve certain types of logical queries. This opens up a new space of equivalent 
orderings and valid GHDs; mirroring the above results, we give a simple test and a complete 
characterization of the valid GHDs for queries that include this aggregation. As a result, we 
obtain similar improvements in runtime relative to previous work. 

Outline. We discuss related work in Section In Section we introduce notation and 
algorithms that are relevant to our work before dehning the Ajar problem and discussing its 
solution, which involves running existing algorithms on a restricted class of GHDs. Section 
provides a succinct characterization of all orderings that are equivalent to a given ordering. 
Section discusses how to connect our work with recent research on GHDs, explaining how 
to construct valid optimal query plans and how to further improve and parallelize our results. In 
Section]^ we discuss how to incorporate product aggregations. 


I 2 I Related Work 

Join Algorithms. The Yannakakis algorithm, introduced in 1981, guarantees a rrmtime of 
0(IN + OUT) for Qf-acyclic join queries |^. Modern multiway algorithms can process any join 
query and have worst-case optimal runtime. In particular, Atserias, Grohe, and Marx derived 
a tight bound on the worst-case size of a join query given the input size and structure. Ngo 
et al. presented the hrst algorithm to achieve this runtime bound, i.e. the hrst worst-case 
optimal algorithm. Soon after, Veldhuizen presented Leapfrog Triejoin, a very simple worst-case 
optimal algorithm that had been implemented in LogicBlox’s commercial database system [27| . 
Ngo et al. later presented the simplihed and unihed algorithm GenericJoin (GJ) that captured 
both of the previous worst-case optimal algorithms. 

GHDs. First introduced by Gottlob, Leone, and Scarcello [^, hypertree decompositions and 
the associated hypertree width generalize the concept of tree decompositions [^. Gonceptually, 
the decompositions capture a hypergraph’s cyclicity, allowing them to facilitate the selective 
use of GJ and Yannakakis in the standard hybrid join algorithm GHDJoin. There are deep 
connections between variable orderings and GHDs [^, which we leverage extensively. Grohe 
and Marx introduced the idea of fractional hypertree width over GHDs, which bounds the 
runtime of GHDJoin by -f OUT) (O hides poly-logarithmic factors) for w dehned to be 

the minimum fractional hypertree width among all GHDs. 

Semirings and Aggregations. Green, Karvounarakis, and Tannen developed the idea of 
annotations over a semiring [^. Our notation for the annotations is superhcially different from 
theirs, solely for notational convenience. We delve into more detail in Section This also has 
been used as a mechanism to capture aggregation in probabilistic databases [23] . 

MPF. Aji and McEliece dehned the “Marginalize a Product Function” (MPF) problem, 
which is equivalent to the the space of Ajar queries with only one aggregation operator. They 
showed that MPF generalizes a wide variety of important algorithms and problems, which also 
implies that Ajar queries are remarkably general. They also provided a message passing al¬ 
gorithm to solve MPF, which has since been rehned [^. We provide runtime guarantees that 
improve the current state of the art. 

Aggregate-Join Queries. There is a standard modihcation to Yannakakis to handle ag¬ 
gregations 1^, but the classic analysis provides only a 0(IN • OUT) bound. Bakibayev, Ko- 
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cisky, Olteanu, and Zavodny study aggregation-join queries in factorized databases [^, and later 
Olteanu and Zavodny connected factorized databases and GHDs/GHDJoin [^. They develop 
the intuition that if output attributes are above non-output attributes, the -l-OUT runtime is pre¬ 
served; we use the same intuition to develop and analyze AggroGHDJoin, a variant to GHDJoin 
for aggregate-join queries. 

Abo Khamis, Ngo, and Rudra present the “Functional Aggregate Query” (FAQ) problem [15] , 
which is equivalent to Ajar. The FAQ/Ajar problems arose out of discussions between Ngo, 
Rudra, and Re at PODS12 about how to extend the worst-case result to queries using aggregation 
and message passing via Green et al.’s semiring formulation. We originally worked jointly on the 
problem, but we developed substantially different approaches. As a result, we split our work. 
We argue the the Ajar approach is simpler, as it yields the best known rrmtime results in 
only a few simple statements in Section We also describe new complete characterizations as 
described above. Pragmatically, these completeness results allow us to connect to more easily 
to existing literature. We have already implemented the algorithm described here in the related 
database engine Emptytteaded 0 13 This engine has run motif hnding, pagerank, and single- 
soruce shortest path queries dramatically faster than previous high-level approaches that take 
datalog-hke queries as input. 

A primary application of multiple aggregation operators is quantihed conjunctive queries 
(QGQ) and the counting variant, which can be expressed as Ajar queries over the semiring 
(V, a) with aggregations involving both operators. Here, we follow FAQ’s idea to formulate this 
as a query with product aggregation. Ghen and Dalmau completely characterized the space of 
tractable QGQ by dehning a notion of width that relies on variable orderings. Ghen and Dalmau’s 
width dehnition includes a complete characterization of the permissible variable orderings for a 
QGQ instance. Their characterization is similar in spirit to the partial ordering we dehne in 
Section that characterizes the space of valid GHDs for an Ajar query. However, their results 
are focused on tractabihty rather than the optimal runtime exponents; our characterization 
extends theirs and has improved runtime bounds. 


3 I AJAR and A Simple Solution 

We start by describing some background material needed to dehne the Ajar problem. After 
that, we formally dehne the Ajar problem and our solution to it. 

3.1 Background 

We use the classic hypergraph representation for database schema and queries [^. A hypergraph 
7^ is a pair (V, £), where V is a non-empty set of vertices and i? C 2^ is a set of hyperedges. Each 
A £ V is called an attribute. Each attribute has a corresponding domain T>^. 

B Data For each hyperedge F E £, there is a corresponding relation Rp C Y\a_^p'D^\ we use 
the notation to denote the domain of the tuples JIagf 
B Join Query Given a set £ and a relation Rp for each F 6 let V = yjp^sF. The join 
query is written x p^£ Rp and is dehned as 

{teV'^ \ 'iF e £ : TTpit) e Rp) 

We use n to denote the number of attributes IV] and m to denote the number of relations |i?|. 
IN denotes the sum of sizes of input relations in a query, and OUT denotes the output size. 


4 


We have been told that LogicBlox has implemented a similar algorithm recently, but their approach is 
not public. We shared our implementation with them several months ago. 
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Algorithm 1 Yannakakis(T = (V,f), {Rf\F G V}) 
Input: Join tree T = (V,£), Relations Rf for each F gV 
1: for all F G V in some bottom-up order do 
2: P G- parent of F 

3: Rp i — Rp tK Rp 

4: end for 

5: for all F G V in some top-down order do 
6: P G- parent of F 

7: Rf g- Rf k Rp 

8 : end for 

9: while F G V in some bottom-up order do 
10: P G- parent of F 

11 : Rp i — Rp N Rp 

12: end while 

13: return Rp for the root R 


A path from A G Vp to B G Vp in a hypergraph is a sequence of attributes, starting with A 
and ending with B, such that each consecutive pair of attributes in the sequence occur together 
in a hyperedge. The number of attributes in the sequence is the length of the path. 

We now define a GHD of a hypergraph. 

^ Definition 2. Given a hypergraph R = {Vp,Sp), a generalized hypertree decomposition is a 
pair (T, x) of a tree T = (V 7 -, ^ t ) and function x : Vt— 2^’” such that 

B For each relation F G £p, there exists a tree node t G Vp that covers the edge, i.e. F C x(f)- 
B For each attribute A G Vp, the tree nodes containing A, i.e. {t G Vr|A G x(f)}: form a 
connected subtree. 


The latter condition is called the “running intersection property”. The x(^) sets are referred 
to as ‘bags’ of the GHD. GHDs are assumed to be ‘rooted’ trees, which imposes a top-down 
partial order on their nodes. Leveraging this order, for any GHD (T, x) and attribute A G Vp, 
we define TOPf{A) to be the top-most node v G Vp such that A G x{v)- 

When each bag of a GHD consists of the attributes of a single relation, the GHD is also called 
a join tree. Joins over a join tree can be processed using Yannakakis’ algorithm (pseudo-code 
in Algorithm[^. The runtime of Yannakakis’ algorithm is 0(IN -f OUT). 

GHDs can be interpreted as query plans for joins. Given a GHD, we first join the attributes 
in each bag using worst case optimal algorithms [W 27 to get one intermediate relation per 


bag. The intermediate relations can then be joined using Yannakakis’ algorithm. This combined 
algorithm is called GHD Join; Algorithm in Appendix [A| gives the pseudo-code for GHD Join. 

The runtime of GHDJoin can be expressed in terms of the fractional hypertree width of the 
GHD: 


► Definition 3. Given a hypergraph R = {Vp,Sp) and a GHD (T, x): the fractional hypertree 
width, denoted fhw{T,R), is defined to be maxtg 7 -p* in which p* is the optimal value of the 


® Traditionally GHDs are defined as a triple (F, Xi where the function X : V 7 —^ 2^” assigns relations 
to each bag. Here we omit this function and implicitly assign every relation to each bag (so X{t) = £p 
for all t G Vp). Though this makes a difference for certain notions of width, it leaves the fractional 
hypertree width unchanged, as adding more relations to the linear program will never make the objective 
value worse. 
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following linear program defined for each t £ V 7 -: 


Minimize 


xf logiN(l-RFl) such that 

FeSn 


'iA e x{t) ■■ E xf > 1 ,VF E £-h ■ xf >0 

F-.AeF 


The fractional hypertree width is just the AGM bound placed on the 
is an upper bound on the sizes of the intermediate relations of GHDJoin. 
,H) queries. 


bags. Thus/iV/'*’“('^’'W) 
GHDJoin runs in time 


Annotated Relations 

To define a general notion of aggregations, we look to relations annotated with semirings [11| . 

► Definition 4. A commutative semiring is a triple (S', ©, 8) of a set S and operators © : S x S ^ 
S, : S X S ^ S where there exist 0,1 £ S such that for all a,b,c € S the following properties 
hold: 

H Identity and Annihilation: a©0 = a, a 8 l = o, 08a = 0 
H Associativity: (a © 6 ) © c = a © (& © c), {a b) c = a {b c) 

H Gommutativity: a© 6 =&©a, a 8 &= 68 o 
B Distributivity: o 8 (& © c) = (a 8 6 ) © (a 8 c) 

Suppose we have some domain K and an operator set O = {©^, ©^,... ©^'j 8 } such that 0 is 
the identity for each ©* £ O and (K, ©*, 8 ) forms a commutative semiring for each i. We then 
dehne a relation with an annotation from K for each tuple. 


► Definition 5. An annotated relation with annotations from K, or a K-relation, over attribute 
set F is a set {(ti,Ai), (^ 2 ,^ 2 ), ..., (tiVjAjv)} such that for all 1 < j < A", £ V^,\i £ K and 

ioY sW \ < j < N : i ^ j ^ ti ^ tj. 


Green et al. dehne a K-relation to be a function Rf : 


11 


Our notion can be 


viewed as an explicit listing of this function’s support. Note that imlike an explicit listing of 
the function’s support, our table does allow tuples with 0 annotations. However, under our 
dehnitions of the operators below, an annotation of 0 is semantically equivalent to a tuple being 
absent (we discuss this further in Section]^. Note that we can have an annotated relation of the 
form of size 1 containing the empty tuple with some annotation. We now dehne joins and 
aggregations over annotated relations. 


Joins over Annotated Relations 

Informally, a join over annotated relations is obtained as follows: (i) We perform a regular join 
on the non-annotated part of the relations, (ii) For each output tuple t of the join, we set its 
annotation to the product of the annotations of the input tuples used to produce t. We dehne a 
join M feE ‘^.s: 


XFeS Rf = {(f, A) : A = Xf in which (^^(t), Xf) £ Rf} 
Fee 
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M Figure 2 Selected examples illustrating the operators over the semiring (R+, +, ■) 


Aggregations over Annotated Relations 

An aggregation over an annotated relation Rp is specified by a pair {A, ©) where A E F, and 
© e O. The aggregation takes groups of tuples in Rp that share values of all attributes other 
than A, and produces a single tuple corresponding to each group, whose annotation is the ©- 
aggregate of the annotations of the tuples in the group. Suppose that R has schema R(A, B) in 
which A is a single attribute and B is a set of attributes. Then, the result of aggregation (A, ©) 
has only the attributes B and 


© 

Ra,b = {(is, A) : tp e ttbR and A = A*} 

{t,Xt)eR-.TTBt=tB 

One can define the meaning of aggregate queries in a straightforward way: first compute the 
join and then perform aggregations. Figure|^shows some examples of operators on relations. For 
the remainder of our work, we assume that all relations are K-relations. 

3.2 The AJAR problem 

► Definition 6. Given some global attribnte set V and operator set O, we define an aggregation 
ordering to be a sequence a = Oi, 02 ,..., as snch that for each 1 < * < s, a^ = (a^, ©i) for some 

e V, ©i e 0|^ In addition, attributes occur at most once, i.e., aj / for each I < j < k < s. 

Informally, the aggregation ordering is just a sequence of attribute-operator pairs such that 
each attribute in the seqnence occurs at most once. Note that the aggregation ordering specifies 
the order and manner in which attribntes are aggregated. The ordering does not need to contain 
every attribnte; we use the term output attributes to denote the attributes not in the ordering. 

V{a) represents the set of attributes that appear in a, and I^(—a) represents y\l/(a) (i.e. 
the ontput attributes). When F C V{a), we use a^ to represent a seqnence /3 that is eqnivalent 
to a restricted to the attributes in B, i.e. F(/3) = F, and any (A, ©),(B,©') 6 a such that 
A, B E F must also appear in with their order preserved. 

► Definition 7 (Ajar). Given some hypergraph T-i = (V,i?) and an aggregation ordering a, an 
Ajar query Q-w.a is a function over instances of T-L such that 

Q'H,a{{RF\F E I?}) = Eaj • ■ • E„|^| Rp. 

For an Ajar query, we define OUT to be the final output size, rather than the output size 
of the join. There are two technical challenges when it comes to solving an Ajar query: 


® Note that, by this definition, the operators in aggregation ordering can be the product aggregation ®. 
However, product aggregations require different definitions, see Section 
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B Multiple aggregation orders can give the same output over any database instance, and using 
some aggregation orders may give faster runtimes than others, e.g. some orders may allow 
early aggregation. Thus we need to identify which orders are equivalent to the given order 
and which order leads to the smallest runtime. 

H OUT for an Ajar query with |a| > 0 is smaller than the output size of the join part of 
the query. Thus the standard GHDJoin runtime of + OUT is harder to achieve for 

Ajar queries. Naively applying a variant of GHDJoin that performs aggregations (Appendix 
Algorithm]^ to Ajar leads to a higher runtime of • OUT (see Appendix [ a|. Thus we 

need to identify which GHDs can be used for efficient processing of Ajar queries. 

We handle these technical challenges in turn. 

3.3 Equivalent Orderings 

Distinct aggregation orders can be equivalent in that they produce the same output on every 
instance. For example, suppose a = ((A,+), (B,+)) and P = ((B,+), (A,+)), where A, B are 
two attributes in some %. Then two Ajar queries with orderings a and /3 clearly produce the 
same output for any instance / over T-i. This is because we can obtain /3 from a by switching 
the positions of two adjacent aggregations with the same aggregation operator. Similarly, if T-i 
consists only of relations {A, B}, {B, G}, then the orderings a = ((A,+), (G, max)) and /3 = 
((G, max), (A, +)) are equivalent, since you can independently aggregate the two attributes away 
before joining the two relations on B. We now formally define equivalent orderings. 

► Definition 8 (Equivalent Orderings). Given a hypergraph T-i, define the equivalence relation 

between orderings =-h such that a =-u P if and only if Q'n,a{I) = for all database 

instances I over the schema T-i. 

We say that two operators ©, ©' are distinct over a domain K (denoted by © ©') if 
3a;, y : x(By X®' y. And © = ©' means that Va;, y 6 K, a; © y = a: ©' a/. Of course, distinct 
operators do not, in general, commute. 

We now state a theorem specifying two conditions under which aggregations can commute. 
We will later show these conditions to be complete. 

► Theorem 9. Suppose we are given a relation Rp such that A, B E F and two operators 
©',© e O. Then 

if one of the following conditions hold: 

- © = ©' 

H There exist relations Rp^ and Rp^ such that A ^ Fi, B ^ F 2 , and Rp^ n Rp^ = Rp. 

Proof. The first condition follows trivially from the commutativity of our operators. The second 
condition follows from the fact that we can “push down” aggregations. 

S(A, 0 )S(b,©/)Bfi n Rp^ = X (S(A,©)-Rf 2 ) 

= ^(B,@')^(-4,0)-RFi X Rf2 


These two conditions give us a simple procedure for testing when an ordering P is equivalent 
to the given a. Algorithm gives the procedure’s pseudo-code. To avoid triviality, we assume a 
and P have the same set of attributes and assign the same operator to the same attributes. First 
we return true if both a and P are empty. Then we check if a can be shown to be equivalent to 
P using the conditions from Theorem]^ This procedure is both sound and complete: 
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Algorithm 2 TestEquivalence(7t = {Vh,£h)^ P) 

Input: Query hypergraph orderings a, (3. 

Output: True if a =-u 13, False otherwise, 
if |q;| = |/3| = 0 then 
return True 
end if 

Remove V{—a) from H, then divide H into connected components Ci,... Cm- 

if m > 1 then 

return AiTestEquivalence('H, aci,[3ci) 

end if 

Choose j such that 13j = ai. Let 13j = 

if 3i < j : Pi = {bi, 0'), ^ 0^- and there is a path from hi to bj in {b^, b^+i,..., b\a\} 

then 

return False 
end if 

Let P' be P with Pj removed. 

Let a' be a with ai removed, 
return TestEquivalencefTt, a\ P') 


► Lemma 10. Algorithm^retums True iff a =H P- 

We omit this lemma’s proof because it is very similar to and implied by the proofs required 
in Section |4] 

To answer Ajar qneries, we need one more component in addition to Algorithm namely 
AggroGHDJoin, a straightforward variant of GHDJoin that performs aggregations (Algorithm]^ 
in Appendix [A|. The first step of AggroGHDJoin is similar to that of GHDJoin, namely perform¬ 
ing joins within each bag of the GHD to get intermediate relations. We need to do some extra 
work to ensure that each annotation is mnltiplied only once, since a relation may be joined in 
mnltiple bags. After that, instead of calling Yannakakis’ algorithm on the intermediate relations, 
AggroGHDJoin calls Aggro Yannakakis (Algorithm in Appendix 0 , a well-known variant of 
Yannakakis that performs aggregations. Aggro Yannakakis initially performs semijoins like Yan¬ 
nakakis (lines 1-8 in Algorithm]^. But in the bottom-up join phase (Hne 11), AggroYannakakis 
aggregates out all attributes that have F as their TOP node, before joining Rp with Rp. 

Armed with Algorithmj^and AggroGHDJoin, we have a simple way to answer an Ajar qnery 
Qn^a- We search throngh all orders, running Procedure 1 to check for equivalence with a. For 
each order P such that P =-u a, we search all through GHDs and check if they are compatible 
with p. A GHD T is defined to be compatible with an ordering /3 if, for all attribute pairs A, B, 
TOP'f(A) being an ancestor of TOP'p[B) implies that either A is an output variable or A occurs 
before B in P (note this precludes B from being an output variable). We can run AggroGHDJoin 
on any compatible GHD to answer the Ajar query. The runtime of AggroGHDJoin on a com¬ 
patible GHD (r,x) is given by 0 OUT). We choose the compatible GHD that 

has the smallest fhw, and use it to answer the query. The theorem below states our nmtime: 

► Theorem 11. Given a Ajar query Q'H,a, letw* denote the smallest fhw for a GHD compatible 
with an ordering =-rc ct; the runtime of our approach is 0(IN™ 0 OUT). 

Comparison to Prior Work 

Work by Olteanu and Zavodny focuses on a special case of Ajar queries, having a sin¬ 

gle aggregation operator. For these queries, they have a similar algorithm that iterates over 
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GHDs to find the best compatible one. Their algorithm achieves the same runtime as ours, 
but cannot handle queries with more than one type of aggregation operator. The FAQ paper 
uses an algorithm called InsideOut to answer general Ajar queries. The running time of In- 
sideOut equals where faqw (FAQ-width) is a new notion of width defined by the 

FAQ authors |15| Section 9.1]. Our algorithm has runtime that is no worse than InsideOut 
{w* < faqw, OUT < and can be much better when output attributes are present. 

► Theorem 12. For any Ajar query, w* < faqw and OUT < 0(IN'^“'^™). 

This theorem is proved in Appendix |B.1[ Notice that the InsideOut runtime is not output- 
sensitive, i.e. it does not have a + OUT term. As a result the runtime can be very high when the 
output is small relative to the number of output attributes; this is demonstrated by Example 
in the appendix. FAQ does have a high-level discussion of approaches to make InsideOut output- 
sensitive |15[ Section 10.2]; indeed, simply using GHDJoin instead of their bespoke algorithm can 
achieve output-sensitive bounds, which we discuss in Appendix JB] 

Discussion 

We presented a remarkably simple procedure for solving Ajar queries. The procedure involves a 
brute force search over different orderings and GHDs, but this is usually unavoidable as finding 
the best ordering and GHD is NP-Hard. Deciding if an ordering is equivalent to the given 
ordering is enabled by Algorithm which takes time polynomial in the number of attributes. 
Determining if a GHD is compatible with an ordering is straightforward as well. Once the 
best GHD is found, we use well known, standard algorithms like AggroGHDJoin to answer the 
query efficiently. The resulting runtime exponents are smaller than those of previous work. The 
simplicity of the algorithm makes it easy to implement; we have already implemented a special 
case of a single additive operator © in our engine [^. 

The equivalence/compatibility tests raise the technically interesting question of finding suc¬ 
cinct characterizations of: 

B All orderings equivalent to any given a. 

B All GHDs that are compatible with at least one of the equivalent orderings. 

We answer the first question in Section]^ by providing a simple characterization of aU equivalent 
orderings, and the second question in Section by defining ‘valid’ GHDs and characterizing their 
structure in relation to unrestricted GHDs. 
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Characterizing Equivalent Orderings 


We described a procedure for determining when two orderings are equivalent. The equivalence 
relation =-fi defines equivalence classes among the orderings, but these classes may be exponential 
in size; we find a more succinct characterization that lets us enumerate all equivalent orderings. 
Ghen and Dalmau obtained a similar order-equivalence characterization for a special case of 
the Ajar problem, namely for aggregations “and” and “or”. The characterization was based on 
a procedure that generated all equivalent orderings. We improve on this result by providing a 
simple and succinct characterization of the equivalence class of an aggregation ordering with any 
number operators. 

To that end, we develop an enumeration of the constraints that are sufficient and necessary 
for an ordering to be in the equivalence class of a. The constraints are of the form “A must 
always occur before B”: 


► Definition 13 (PREC). Given an Ajar query Q-H,a, define a constraint PREC C V x V such 
that {A, B) e PREC if and only if A precedes B in all orderings that are equivalent to a. 
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We say PREC(j 4, B) is true if and only if {A,B) E PREC. 

Trivially, the number of pairs in PREC is less than n^. We note that we can use PREC 
to define a (strict) partial ordering on the attributes; the constraints are clearly antireflexive, 
antisymmetric, and transitive. We use <'n,a to denote this partial order. Given an Ajar query 
QH.a, <H,a is a partial order of attribute-operator pairs such that for any (A,©), (B,©') 6 a, 
(A,©) <'R,a (B,©') if PREC(A, B) (see Definition 22 for the exact definition). The partial order 
<u,a is easier to use for proofs; we use the partial order to show the soundness and completeness 
of these constraints. 


► Theorem 14 (Soundness and Completeness of <^ 1 , 0 )- Suppose we are given a hypergraph T-i = 
{y,8) and aggregation orderings q,/3. Then a =-}^ P if and only if P is a linear extension of 

We first describe a procedure to compute the precedence relation PREC. After that, we reason 
about its completeness. 


Computing PREC 

To assist in building PREC, we dehne a constraint of the form ‘A and B cannot commute”: 

► Definition 15 (DNC). Given an Ajar query QH,a, define a constraint DNC C V x V such that 
(A, B) £ DNC if and only if A and B are in the same order in any P such that P =-« a. 

Once again, we say DNC(A, B) is true if and only if (A, B) 6 DNC. We prefer to work with 
DNC because we have already discussed when aggregations can commute in Theorem the 
conditions of that theorem specify when DNC is FALSE. However, we can immediately derive 
a simple relationship between PREC and DNC: 

► Lemma 16. Given an Ajar query Qu.a, for any A,BeV, PREC{A,B) iff DNC{A, B) and 
A precedes B in a. 

We now develop conditions when DNC is true. Recall that Theoremstates that two aggre¬ 
gations can commute if ( 1 ) they have the same operator or ( 2 ) if they can be separated in the 
join query; the simplest structure that violates both of these conditions is an edge that contains 
two attributes with differing aggregating operators. 

► Lemma 17. Given an Ajar query Qn,a, suppose (A,©), (B,©') e a. //© 7 ^ ©' and there 
exists an edge E E £ such that A, B E E, then DNC{A, B). 

LemmapT|serves as a base case, but we want to extend the violation of Theoremj^s conditions 
beyond single edges to paths. To do so, consider the following examples of how our commuting 
conditions interact with paths of length two. 


► Example 18. Consider the query 


maxmaxB(A, B) m S{B, C) hence a = (A, B, C). 

A 
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gives 


No two attributes can be separated, which implies DNC(A,B) and DNC(A,C'). Lemma 
us the former constraint, but not the latter one. This example indicates that it may be possible 
to extend a constraint DNC(A, B) along an edge {B, G}. On the other hand, consider the query 


max maxii(A, B) xi S{B, C) so a = (B, A, C). 

A 


Note that A and C can be separated, which implies that only DNC(A, B) holds. Note that, as 
before. Lemma gives us this constraint. This example suggests that we cannot extend every 
DNC(A, B) constraint along an additional edge. 
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The key difference between the two examples is the relative order of A and B in a, which suggests 
that we can only extend DNC(A, _B) along an edge if A precedes B in a, i.e. if PREC(A, iJ). 

► Lemma 19. Given an Ajar query Q-H,a, suppose (A,©), (_B,©') 6 a. If ® and 

3C eV,E e £ : PREC{A, C) and B,C E E, then DNC{A, B). 

PREC is transitive, which implies: 

► Lemma 20. Given an Ajar query Q-u^a, suppose {A, ©), (B, ©') 6 a. If3C : PREC{A, C) and PREC{C, B), 
then DNC{A, B). 

The above transitivity condition interacts with the condition from Lemma in interesting 
ways. 

► Example 21. Consider the query with a = {A, B,C, D), 

'y ' max.max. 'y^ R{A,B) M S{B,D) M T(C,D). 

A ^ ^ D 


No attributes can be separated, which implies DNC(A, B), DNC(A, C), DNC(B, B), and DNC(C', D). 
Transitivity gives DNC(A,B) as well. Now let us derive these constraints using Lemmas 
and 
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19 


20 


Lemma 
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gives us DNC(A, C), DNC(B,B), and DNC(C, B). Note that at this point. 
Lemma gives us no more constraints. Only after the transitivity of Lemma adds the 
constraint DNC(A, B) can Lemma 19 add the constraint DNC(A, B), completing the set of con¬ 
straints. 


It turns out that these three relatively simple lemmas are the sufficient and necessary con¬ 
straints on the equivalence classes of orderings; no other conditions are necessary to complete 
the proofs the soundness and completeness of 

We note that our current specifications of PREC and DNC are mutually recursive. The PREC 
and DNC sets build up in rounds; Lemma [Tt] provides their initial values, and Lemmas |16| |19| 
and [^iteratively build up the sets further. We keep applying these lemmas until the sets reach a 
fixed point. This takes at most 2 |q|^ iterations, as we must add at least one additional attribute 
pair per iteration, and there can be only |a|^ pairs of attributes in each set. Thus the overall 
runtime of computing these constraints is polynomial in the number of attributes. We detail this 
process in Appendix [Cj 

For convenience of notation, we make one modification to the definition of the partial order 
When A is an output attribute and B is not, we define A <-H,a B to be true. So we can 
formally state the definition as: 

► Definition 22 (<7^,0). Given a Ajar query Q'n,a, we define A <H,a B to be true if either (i) 
A is an output attribute and B is not, or (ii) PREC(A, B) is true. 


Soundness and Completeness of <n,a 

To give an intuition on how we prove the soundness and completeness of <-H,a, we now state two 
key lemmas (with proofs in Appendix [Cjl illustrating properties of 

► Lemma 23. Suppose we are given a hypergraph B = {V,£) and an aggregation ordering a. 
Suppose (A,©), (B,©') 6 a for differing operators © 7 ^ ©'. Then, for any path P in B between 
A and B, there must exist some attribute in the path C E P such that C <-H,a A or C <-H,a B. 

Lemma intuitively states that incomparable attributes with different operators must be 
separated in B by their common predecessors in <H,a- 
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► Lemma 24. Given a hypergraph H = (V, £) and an aggregation ordering a, suppose we have 
two attributes A, B E V{a) such that A <H,a B. Then there must exist a path P from A to B 
such that for every C E P,C ^ A we have A <^,0 C. 

Given these two lemmas, the proof of Theorem is straightforward. Lemma implies 
that, given an attribute ordering /3 that is a linear extension of each inversion of attribute- 

operator pairs must either have equal operators or have attributes that can be separated, allowing 
us to repeatedly use Theoremj^to transform /3 into a. Lemmaimplies that, given an attribute 
ordering f) that is not a linear extension of <K,aj we can construct a counterexample. 


Discussion 


We obtained a sound and complete characterization of all orderings equivalent to any given or¬ 
dering. This result extends the work of Chen and Dalmau |^, who had characterized equivalent 
orderings for queries with logical “and” and “or” operators. Our characterization is simple, con¬ 
sisting of a partial order whose linear extensions are precisely the equivalent orderings. FAQ [15] ’s 
method for identifying equivalent orderings is sound but not complete. That is, there exist equiva¬ 
lent orderings that the FAQ method does not identify as being equivalent (Appendix Example [5^. 
In contrast, our characterization is guaranteed to cover all valid orderings. This completeness 
property lets us create a decomposition that is guaranteed to preserve all node-monotone widths 


(see Definition 28 I. This in turn lets us get tighter guarantees on our runtime exponent, using 
the notion of submodular width (Section |5.3|. 


I 5 I Decomposing Valid GHDs 

We express our Ajar algorithm directly in terms of GHDs, rather than in terms of aggregation 
orderings. As such, our goal is the characterization of GHDs that are compatible with at least 
one equivalent ordering, i.e. the GHDs that can be used to answer an Ajar query. We call 
a GHD valid if it is compatible with at least one equivalent ordering. We hrst give a simple 
characterization of valid GHDs. Then we demonstrate a way to reduce the problem of hnding a 
minimum-width valid GHD to multiple subproblems on unconstrained GHDs (Section |5.1[ ). This 
decomposition of the problem gets us three things: 


We can speed up our brute force search for an optimal valid GHD. We can also hnd ap¬ 
proximately optimal valid GHDs in polynomial time using Marx’s GHD approximation algo¬ 
rithm (Section |5.2| . 

We can apply existing MapReduce join algorithms that utilize GHDs [^, obtaining efficient 
parallel algorithms for solving Ajar queries (Section |5.4| . 

We can apply improved join algorithms to further reduce our runtime exponent (Sec¬ 

tion 


5.31. 


5.1 Valid and Decomposable GHDs 

We can easily characterize valid GHDs by combining the definition of compatible GHDs with 
Theorem [T4| 

► Theorem 25. For a Ajar query Q-u,a, o, GHD (T, x) ** valid if and only if for every pair of 
attributes A,B such that TOPt(A) is an ancestor ofTOP-fiB), B A. 

Theorem gives us a criterion specifying which GHDs can act as query plans. We now 
consider the problem of hnding a minimum width valid GHD for any Ajar query. We call a 
GHD optimal if it has the minimum width possible for valid GHDs. We show how to reduce 
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the problem of finding an optimal valid GHD into smaller problems of finding ordinary optimal 
GHDs. This unlocks a trove of powerful GHD results and makes them applicable to our problem. 

Given an Ajar query Q-u.a, suppose we have a subset of the nodes V C V. Define Sy to 
be {E £ £\E nV / 0}, i.e. the set of edges that intersect with V. As before, ay denotes the 
aggregation ordering restricted to the nodes in V. Additionally define to be {« £ H|Vw £ 
y, w ■^H,a w}, i-e. the nodes in V that have no predecessors in V according to the partial ordering 
<H,a- Finally, note that ay\yo is then ay with all the nodes in removed (note that this 
makes the nodes in output attributes). 

► Definition 26. Given an Ajar query Qu,a-, we say a GHD (T, x) is decomposable if: 

B There exists a rooted subtree To of T such that x(7o) = V(—a) (i.e. output attributes). 

B For each connected component C of 'H\V-a, there is exactly one subtree 7c £ T\To such 
that 7c is a decomposable GHD of Q(UE^eci^,£c),a^^co- 

We start by connecting this idea of decomposable GHDs to valid GHDs. We only give proof 
sketches here; see appendix l^for the full proofs. 

► Theorem 27. Every decomposable GHD is valid. 

Proof. (Sketch) Suppose the Ajar query is Qn,a- We need to show for any A, B such that 
TOP'j-(A) is an ancestor of T07V(5), A B. We use induction on \a\. If |a:| = 0, all 

GHDs are valid and decomposable. For |q| > 0, To ensures that the output attributes are above 
non-output attributes. If A and B are non-output attributes and TOPr{A) is an ancestor of 
TOPq-{B)^ then both are in some 7c- By the inductive hypothesis, 7c is valid with respect to 
Q(UEe£p£;.£c),c«c\co • inspecting the partial order created by this subgraph, we conclude that 
A iiu.a B as desired. A 

Every valid GHD may not be decomposable. However, every valid GHD can be transformed 
into a corresponding decomposable GHD using some simple transformations. Each bag of the 
resulting decomposable GHD is a subset of one of the bags of the original GHD. Thus the fhw 
of the decompsable GHD is at most the fhw of the original valid GHD. In fact, we can make a 
more general claim, using a notion of node-monotone functions, defined next. 

► Definition 28. Given a hypergraph H, = (V-h, i?w), we dehne a function to be node-monotone if 
it is a function 7 : 2^'^ R such that A C B C Vu ■ l{A) < 'y{B). Given any node-monotone 
function 7 , we define the 7 -width of a GHD (T, x) over H as max„gVr 7(x(*^))- 

Many standard notions of widths can be expressed as 7 -widths for a suitably chosen 7 . Specif¬ 
ically: 

► Proposition 1. Suppose we are given a hypergraph H. = [V-HtS-u) and database instance I on 
H. Then for each of following notions of width: (i) Treewidth (ii) Generalized Hypertree Width 
(iii) Fractional Hypertree Width (iv) Submodular Width, there exists a node-monotone function 
7 such that 7 -width equals the given notion of width. 

As a simple example, tree-width can be expressed as 7 -width for 7 (A) = |A| — 1. We can now 
relate valid and decomposable GHDs with respect to their 7 -widths. 

► Theorem 29. For every valid GHD (T, x); there exists a decomposable GHD (T^ xO such that 
for all node-monotone functions 7 , the x-width of (T\ x') *5 no larger than the x-width of (T, x)- 

Proof. (Sketch) Suppose the Ajar query is Qn.a- We transform the given GHD (T, x) into 
{T',x') snch that for each v' £ V-)-, there exists a o £ V 7 - such that x'{u') C x(n). The result 
then follows from the node-monotonicity of 7 and the dehnition of 7 -width. Any transformation 
of a GHD that ensures that all new bags are subsets of old bags, is called width-preserving. 

We then transform the GHD (T, x) to satisfy the following properties (using width-preserving 
transformations): 
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B Every t E T is TOP'j-{A) for exactly one attribute A. 

B For any node t E T and the subtree 7t rooted at t, the attributes {w e V’|TOP 7 -(t;) e Tt} 
form a connected subgraph of T-i. 

We can show, by induction, any valid GHD that satishes these two properties is decomposable. 
Intuitively, the first transformation ensures the subtree To exists as desired. The second trans¬ 
formation ensures that each of the Tc’s exists and satishes the requisite properties. ◄ 

This theorem lets us restrict our search to the smaller space of decomposable GHDs (instead of 
all valid GHDs) when looking for the optimal valid GHD. Moreover, the space of decomposable 
GHDs is simpler; it can be factored into smaller spaces of unconstrained GHDs, as we show 
next. We present the dehnition of characteristic hypergraphs, which are intuitively the set of 
hypergraphs that specify the factors, i.e. the unconstrained GHDs. 

Our goal is two-fold: (1) to be able to split a decomposable GHD into component GHDs of 
the characteristic hypergraphs and (2) to be able to take arbitrary GHDs of the characteristic 
hypergraphs and connect them to create a decomposable GHD of the original Ajar problem. The 
dehnition of decomposable GHDs decomposes a GHD into a series of sub-trees To, ■ ■ ■ ,Tk- The 
dehnition specihes that the subtrees Ti, ■ ■ ■ ,Tk must be decomposable GHDs of (smaller) Ajar 
problems. Additionally, it is simple to show To is a GHD of the hypergraph {V{—a), {E 6 £\E C 
H(—a)}). If we apply this decomposition recursively to the subtrees Ti,... ,Tk, we can divide 
any decomposable GHD into a series of (unrestricted) GHDs of particular hypergraphs. This 
provides the basis of our dehnition of the characteristic hypergraphs; we dehne a hypergraph "Hq 
that specihes the hypergraph corresponding to To and then recurse on the smaller Ajar queries 
specihed in Appendix Dehnition [85] 

However, if we are given arbitrary GHDs of the hypergraphs as dehned thus far, we may not 
be able to stitch them together while preserving the running intersection property of GHDs. To 
ensure this stitching is possible, we need the characteristic hypergraphs to contain additional 
edges that we can use to guarantee the running intersection property. Intuitively the edges we 
add will be the intersections of the adjacent subtrees in our decomposition; for example, for 
any connected component C of 'H\V{—a), To and Tc are adjacent, and we will add the edge 
x(7o) rix(Tc) to the corresponding hypergraphs. We can use these ‘intersection edges’ to connect 
particular nodes in the adjacent subtrees. 

► Definition 30. Given an Ajar problem Qw.a, suppose Ci,... ,Ck are the connected compo¬ 
nents of 7^ \ V-a- Dehne a function H{'H, q) that maps Ajar queries to a set of hypergraphs as 
follows: 

_ 6*+ = Usefc. -E for all 1 < i < fc 

- T7o = (V_„, {E e£\FC V_c} U {V_„ n G+ll < i < fc}) 

- n+ = {c+,£c,u{v.^nc+}) 

- Bin, a) = {Bo} u Ui<,<fc H{Hi, ac,\cp) 

The hypergraphs in the set H{'H,a) are dehned to be the characteristic hypergraphs. 

Note that the dehnition of characteristic hypergraphs depends only on (77, a), and not on 
a specihc GHD or the instance. Now we state a key result that lets us reduce the problem of 
searching for an optimal valid GHD over 77 to that of searching for (not necessarily valid) optimal 
GHDs over characteristic hypergraphs. Each decomposable GHD corresponds to a GHD over 
each characteristic hypergraph; conversely, a combination of GHDs for characteristic hypergraphs 
gives us a decomposable GHD for 77. Formally: 

► Theorem 31. For an Ajar query Q-H,a, suppose Bo, ■ ■ ■ ,Bk are the characteristic hypergraphs 
77(77,a). Then GHDs Go,Gi,... ,Gk of Bo, ■■■ ,Bk can be connected to form a decomposable 
GHD G for Qh.o! ■ Gonversely, any decomposable GHD G of Qu.a can be partitioned into GHDs 
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Go; Gi,, Gk of the characteristic hypergraphs , Hk. Moreover, in both of these cases, 

'y-width{G) = ma.Xi "f-width{Gi). 

The proof is provided in the appendix, but it is a straightforward application of dehnitions. 

► Corollary 32. Given an optimal GHD for each characteristic hypergraph of an Ajar query 
Qu.a, we can construct an optimal valid GHD. The width of the optimal valid GHD equals the 
maximum optimal-GHD-width over its characteristic hypergraphs. 

This reduces the problem of hnding the optimal valid GHD to smaller problems of hnding 
optimal GHDs. We hrst present the decomposition in the FAQ paper. Then we present 
several applications of our decomposition, and compare them to their FAQ analogues. 

FAQ’s Decomposition 

The FAQ paper uses a decomposition of the problem that is not width-preserving. They remove 
the set of output attributes V{—a) and decompose the rest of the hypergraph into smaller 
hypergraphs. They construct a regular Variable-Ordering/GHD for each hypergraph. Then they 
add all output attributes V{—a) into each bag of each of the GHDs, and then stitch the GHDs 
together. This output addition to the bags of the GHDs leads to a potentially 2x increase in width 
compared to our method which stitches the GHDs together without changing their width. As a 
result, FAQ’s decomposition incurs higher runtime costs in each application of the decomposition, 
as we see in the next three subsections. 

► Example 33. Gonsider a query with output attribute A 

N SiB,C)). 

B,+ c,+ 

The optimal valid GHD for this query has bags {A, B} and {B, G}, and thus has fhw 1. The faqw 
is also 1. If we apply our decomposition, we get a GHD with bags {A}, {A, B}, {B, G} which still 
has fhw 1. FAQ’s decomposition on the reduced hypergraph (with output attribute A removed) 
has one bag {B,G}. Adding A to it gives a single bag {A,B,G} resulting in a fhw of 2. More 
generally, consider query Qn with a = ((Bi, -b), (B 2 , -b),... (B„, -b)) and relations T{Ai,Bi) and 
also Rij{Ai, Aj), Sij{Bi, Bj) for i,j £ {1, 2,..., n}. Our decomposition gives a GHD with bags 
{Ai, A 2 ,..., A„}, {Ai,Bi}, {Bi, B 2 ,..., Bn}, which has fhw n/2. FAQ’s decomposition has a 
single bag and fhw equal to n. 

5.2 Finding optimal valid GHDs 

Armed with Gorollary we simplify the brute force search algorithm for finding optimal valid 
GHDs. 

► Theorem 34. Let Qu,a be an Ajar query. The optimal width valid GHD for this query can 
be found in time 

This runtime for hnding the optimal valid GHD can be exponentially better than the naive 
runtime: 

► Example 35. Gonsider the star query H = {{A, Bi,... Bn}, {{A,Bi} | 1 < i < n}), a = 
(Bi, -b), (B 2 , -b),..., {Bn, +). A is the only output attribute. Removing A breaks the hypergraph 
into n components, so there are n -b 1 characteristic hypergraphs, each of size < 2. Finding the 
optimal valid GHD takes time 0{n), whereas the standard algorithm takes time exponential in 
n. 
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We can also approximate the GHD |16| : 

► Theorem 36 (Marx's GHD approximation). Let Q be a join query with hypergraph Li and 
fractional hypertree width w. Then we can find a GHD for Q in time polynomial in |7^|, that has 
width w' < w^. 


We can replicate Marx’s result for valid GHDs. 


► Theorem 37. Let Q-H,a be an Ajar query, such that its minimum width valid GHD has width 
w. Then we can find a valid GHD in time polynomial in \LL\ that has width w' < . 


FAQ [^’s decomposition lets them apply Marx’s approximation as well. However, their 
decomposition is not width-preserving i.e. the width of their final GHD is higher than the width 
of the GHDs they construct for the hypergraphs in the decomposition. Thus their decomposition 
gives a weaker width guarantee of faqw^ + faqw 15 Theorem 9.49]. The extra +faqw factor 


is due to output addition. Our guarantee, w^, is strictly smaller {w is the width of the optimal 
valid GHD) as w < faqw by Theorem 


12 


5.3 Tighter Runtime Exponents 

Marx introduced the notion of submodular width {sw) that is tighter than fhw, and showed 
that a join query can be answered in time The O in the exponent is because Marx’s 

algorithm requires expensive preprocessing that takes time. After the pre-processing, 

the join can be performed in time IN®™. Despite the O in the exponent, this algorithm can be 
very valuable because there are families of hypergraphs that have unbounded fhw but bormded 
sw. We can apply Marx’s algorithm to the characteristic hypergraphs, potentially improving our 
runtime. Marx also showed that joins on a family of hypergraphs are fixed parameter tractable if 
any only if the submodular width of the hypergraph family is bounded [^. Moreover, adaptive 
width (applicable only when relations are expressed as truth tables) is unbounded for a 
hypergraph family if and only if submodular width is unbounded. Gorollary gets us an 
analogous tractability result for Ajar queries. 

► Theorem 38. We can answer an Ajar query Qn.a in time _|_ 

OUT). 

Recent work uses degree information to more tightly bound the output size of a query. 

The bound in the reference, called the DBF bound, has a tighter exponent than the AGM bound, 
while requiring only linear preprocessing to obtain. The authors also provide algorithms whose 
runtime matches the DBF bound. We can define DBP-width dbpwiff ,LL) such that 
is the maximum value of the DBF bound over all bags of GHD T. We then use the improved 
algorithm in place of GJ in AggroGHDJoin. This lets us get tighter results “for free”, reducing 
our runtime to IN'^^^™ instead of IN-^^™. Formally; 

► Theorem 39. Given an Ajar query Q-H,a and a valid GHD for Li, we can answer the query in 

time 0(IN'^^*’™^^’^^-|-0UT). Equivalently, we can answer the query in time dbpw(T,'H)_^ 

OUT). 

As discussed before, FAQ has a non-width-preserving decomposition. We can combine FAQ’s 
decomposition with the DBF bound as we did above. Suppose we perform FAQ’s decomposition, 
and IN'^“®™^ denotes the highest value of the DBF bound on each of their characteristic hyper¬ 
graphs, and on the set of output attributes. Thus the DBP-width of each of their characteristic 
hypergraphs, and the outputs, is faqw+. However, when they perform output addition, the 
DBP-width of the resulting GHDs can go up to 2faqw+. This happens when the DBF bound 
on both the outputs and one of the characteristic hypergraphs equals IN'^“®™^. So if we apply 
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the DBP result to FAQ’s decomposition, we get a runtime of + OUT). Thus their 

decomposition causes them to incur an extra factor of 2 in the exponent. They similarly incur a 
factor of 2 increase in exponent for the submodular width algorithm. 


5.4 MapReduce and Parallel Processing 

The GYM algorithm uses GHDs to efficiently process joins in a MapReduce setting. GYM 
makes use of the GHD structure to parallelize different parts of the join. Given a GHD of depth d, 
and width tu, with n attributes, GYM can perform a join in a MapReduce setting in 0(d+log(n)) 
rounds at a communication cost of M“'^(IN™ + OUT)^ where M is the memory per processor on 
the MapReduce cluster. Gombining this with the degree-based MapReduce algorithm gives 
us the following result: 

► Theorem 40. Given an optimal valid GHD (T*, x) of depth d, and DBP-width dbpw, we 
can answer an Ajar query with Gommunication Gost equal to -|- OUT)^) 

in d -|- log(n) MapReduce rounds, where n is the number of attributes and MI is the available 
memory per processor. 


A GHD can have depth up to 0(n), in which case the algorithm can take a very large number 
of MapReduce rounds (0(n)). To address this, the GYM paper uses the ‘Log-GTA’ algorithm 
to reduce the depth of any given GHD to log(?T.) while at most tripling its width. This lets it 
process joins in log(n) MapReduce rounds at a cost of -|- OUT)^. 

Log-GTA involves some shuffling of the attributes in the GHD bags, so naively applying 
it to a valid GHD could make the GHD invalid (see example 61 in the Appendix). But our 


decomposition lets us apply Log-GTA to the GHD of each characteristic hypergraph, and then 
stitch the short GHDs together. Our decomposition is recursive in nature; let d' be the maximum 
recursive depth of the decomposition for a given Qn,a- Then the depth of the shortened GHD 
of each characteristic hypergraph is 0(log(n)), and so the depth of the valid GHD obtained by 
stitching them together is 0(d'log(n)). This gives us the result: 

► Theorem 41. If dbpw is the DBP width of a Ajar query, we can answer that query with 
Gommunication Gost equal to -|- OUT)^) in d'log(u) MapReduce rounds, 

where n is the number of attributes and M is the available memory per processor. 


d' can vary from 0(1) to 0{n) depending on the query. The star query from example 35 has 


d’ = 2, which lets us process it in log(n) MapReduce rounds. Any query that only has a single 
type of aggregation will have d' = 2 as well. On the other hand, a query with one relation having 
n attributes, 1 output attribute, and alternating ^ and max aggregations, will have d' = n, and 
will be hard to parallelize. 

Olteanu and Zavodny use valid GHDs to answer Ajar queries for the special case 

of a single type of aggregation. But they have no notion of a decomposition and attempting 
to shorten a valid GHD directly, without using a decomposition, may make it invalid. FAQ’s 
decomposition may be used to shorten GHDs similarly to ours, but leads to an increased width 
of ifaqw compared to oru 3w (where w < faqw is the width of our optimal valid GHD). This 
is again because of output addition, if the output attributes have a width of faqw, and the 
shortened GHDs of the characteristic hypergraphs have a width of 3faqw, then the total width 
will be Afaqw. 


6 I Product Aggregations 

The primary application of queries with multiple aggregations is to establish bounds for the 
Quantihed Gonjunctive Query [QCQ] problem [^, and its counting variant, ffQCQ. We now 
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introduce a new type of aggregation, called product aggregation, that lets us efficiently handle 
QCQ queries. We dehne the Ajar problem for product aggregations, and then extend our algo¬ 
rithm from Section [33| to handle this new type of Ajar query. We then dehne a decomposition 
analogous to that in Section]^ A more detailed version of this section with additional motivation, 
examples, and proofs can be found in Appendix [E] 

6.1 AJAR queries with product aggregates 

A product aggregation aggregates using the ® operator. Throughout the paper, we assumed that 
an absent tuple effectively has an annotation of 0. Taking this into account, we formally dehne 
the product aggregation. Let B = F\A: 

► Definition 42. Rab = {{ts,^) ■ '^tA e °tA^ Rab and A = A*} 

(A,®) 

We can adjust the dehnition of aggregation orderings and Ajar queries to allow this new 
type of aggregation. QCQ queries can now be expressed as Ajar queries on the ({0, l},max, •) 
semiring. We assume for this section that ® is idempotent, i.e. a ® o = a for all a. We describe 
how to work with non-idempotent products in Appendix |E.4| 

6.2 Algorithms for product aggregates 

For aggregation orderings that have product aggregations, the rules for determining when two 
orderings are equivalent are somewhat different; product aggregations can be performed before a 
join. We illustrate this with an example: 

► Example 43. In the semiring ({0,1}, max, •), suppose we have two relations R{A, B) = 

{((0,0), x), ((0,1), y)} and S{B, C) = {((0, l),p), ((1,1), g)}. Consider the Ajar query ^2(3,-) ^ 

S[B, C). If we compute the join, we will get two tuples with the annotations x ■ p and y ■ g, and 
then aggregating over B will produce a relation with the element {(0,\),x-p-y-q). However, note 
that X ■ p ■ y ■ q = {x ■ y) ■ (j) ■ q), implying that Y.(b ■) B) ^ C) = {Yj{b ■) ■®)) ^ 

(E(r,)^(S,C)). 

Now we describe our algorithm for solving Ajar queries when product aggregations are 
present. Our algorithm follows the same lines as the algorithm from Section |3.3| Recall that 
the algorithm consisted of searching for equivalent orderings, then searching for GHD compatible 
with an equivalent ordering, and running AggroGHDJoin on the GHD with the smallest fhw. For 
product aggregations, we need to modify our algorithm for testing equivalent orderings, and our 
dehnition of compatibility; we do these in turn. 

Testing orderings for equivalence 

We describe how we modify Algorithm when product aggregations are present. Let PA (a) 
denote the set of product attributes in ordering a. We make two changes to Algorithm m (1) 
Instead of removing V{—a) and dividing R into components, we remove V{—a)U PA(a) and then 
divide R into components Ci, C 2 , ■ ■ ■, Cm- Then for each Ci we dehne G' = (PA(a)ne). 

That is G' has the attributes of Ci as well as the product attributes that are in the same 
hyperedge as some attribute in Ci- Then we recursively call the equivalence test on (ac'i/^C') 
instead of on (aCi, Pci)- (2) When we are checking for a i < j such that ©' ^ ©' and there is a 
path in {bi, &i+i,..., &|a|}: we instead check for a path in 

{{b„ bi+i, &|a|} \ PA(a)) U {bi, bj} 


Recall for any R C V, Sy is defined to be {E G £\E nV^ 0}. 
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That is, we look for a bi that has a different operator that bj, and has a path to bj consisting 
only of bi, bj, and semiring attributes in Appendix [E| gives the pseudo-code 

for the modified algorithm (Algorithm]^ and proves that it is sound and complete. 

► Lemma 44. The above Algorithm returns True if and only if a =H P- 

Compatible GHDs 

Product aggregates not only change the set of equivalent orderings, but also the set of GHDs 
compatible with a given ordering. In fact, product aggregates allow us to break the rules of 
GHDs without causing incorrect behavior. We express this using a simple variant of GHDs, 
called aggregating generalized hypertree decompositions (AGHDs). Informally, AGHDs are GHDs 
that can violate the running intersection property for attributes that have a product aggregation. 
AGHDs are formally defined in Appendix]^ We determine compatibility for AGHDs as follows: 
An AGHD is compatible with an ordering /? if for every attribute pair a, b such that one of the 
TOP{a) nodes is an ancestor of a TOP{b) node, a precedes b in /3. 

We can now modify our algorithm from Section |3.3| to detect equivalent orderings using 
Algorithm ^ then search for compatible AGHDs, and run AggroGHDJoin over the compatible 
AGHD with the smallest fhw. Our runtime is given by the next theorem. 

► Theorem 45. Given a Ajar query Q-H,a possibly involving idempotent product aggregates, let 
w* be the smallest fhw for an AGHD compatible with an ordering equivalent to a. Then the 
runtime for our algorithm is ©(IN'" + OUT). 

Decomposing AGHDs 

We can apply the ideas from Section to Ajar queries with product aggregates as well. We 
can define a notion of decomposable AGHDs for queries with product aggregates, and show the 
following results: 

► Theorem 46. All decomposable AGHDs are compatible with an ordering P such that P a. 

► Theorem 47. For every valid AGHD (T, x), there exists a decomposable iT' ,x') such that for 
all node-monotone functions 7 , the j-width of {T', x’) is no larger than the x-width of (fT, x)- 

We can define characteristic hypergraphs similarly to how we did in Section]^ (see Appendix]^ 
for a formal definition). We have the following result: 

► Theorem 48. For an Ajar query Q-H,a involving product aggregates, suppose Hq, ... ,H.k 
are the characteristic hypergraphs H{H,a). Then GHDs Gq,Gi, ... ,Gk of Ho,... ,H.k can be 
connected to form a decomposable AGHD G for Q-u,a- Gonversely, any decomposable AGHD G of 
Qn,a can be partitioned into GHDs Gq, Gi,..., Gk of the characteristic hypergraphs Ho, • ■ •, Hk- 
Moreover, in both of these cases, 'j-width(G) = maxi j-width(Gi). 

These theorems let us apply all the optimizations from Section |5.2| |5.3| and |5.4| to Ajar 
queries with product aggregates. 

Comparison to FAQ 

The runtime of InsideOut on a query involving idempotent product aggregations is given by 
0 (IN'^“®'"), where the faqw depends on the ordering, and the presence of product aggregations. 
Our algorithm for handling product aggregations recovers the runtime of FAQ. Formally, 

► Theorem 49. For any Ajar query involving idempotent product aggregations, IN™ -f OUT < 
2 . 
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The proof is in Appendix |B.1[ By applying ideas from the FAQ paper to our setting, we can 
also recover the FAQ runtime on ^QCQ (Appendix |E.3[ |. Algorithm]^ for detecting equivalence 
of orderings is both sound and complete; in contrast, FAQ’s equivalence testing algorithm is 
sound but not complete. Moreover, we have a width-preserving decomposition for queries with 
product aggregates. This allows us to get tighter runtime exponents in terms of submodular and 
DBP-widths (Theorems |38[ [3^ and efficient MapReduce Algorithms (Theorems 40 411. 


I 7 I Conclusion 

We investigate solutions to and the structure of Ajar queries: aggregate-join queries with mul¬ 
tiple aggregators over annotated relations. We start by providing a very simple algorithm based 
on a variant of the standard GHDJoin algorithm that generates query plans by relying on a 
simple test of equivalence between aggregation orderings. This naive approach is sufficient to 
recover and surpass the runtime of state-of-the-art solutions. We proceed to investigate more 
interesting technical questions regarding the structure of Ajar queries. We hrst develop a partial 
ordering that fully characterizes equivalent orderings. We then develop a characterization of the 
corresponding valid GHDs, describing how they can be decomposed into ordinary, unrestricted 
GHDs. This reduction connects us to a trove of parallel work on GHDs. We hnish by extending 
our work to handle product aggregations. 
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Algorithm 3 GHDJoin('H = {Vh,£h), (T(Vr,^r). x). {Rf\F € En}) 
Input: Query hyper graph H, GHD (T, x)j Relations Rp for each F £ 
1 : Sp ■£- 0 

2: for all t £ Vp do 

3: Rt ^ ^'h}) 

4: ^ 5'i^ U GJ('Ht, {7r^(i)i?F|^ € £•«}) 

5: end for 

6 : return Yannakakis{T, Sr) 


GenericJoin 

We first describe the AGM bound on the join output size developed by Atserias, Grohe, and 
Marx [^. Given query hypergraph Rq = {y,S) and relations {Rf\F £ £}, consider the following 
linear program: 

Minimize E Xf logiNd^Fl) 

Fee 

\/v £V : xp >t 

F:vE.F 

\/F £ £ :xf > 0 


Any feasible solution "if is a fractional edge cover. Suppose p* is the optimal objective. Then 
the AGM bound on the worst-case output size of join M peS Rf is given by IN^ “ riFGf i-rfI"^^- 
We will use to denote the AGM bound on a query Q. The GenericJoin (GJ) al¬ 
gorithm 20 computes a join in time for any join query. GJ will be used as a 


subroutine in a later algorithm, where GJ{R, {Rf\F £ £-h}) denotes a call to GenericJoin with 
one input relation Rp per hyperedge F in hypergraph R. 


Yannakakis 

Yannakakis’ algorithm operates on a-acyclic queries. There are several different equivalent 
definitions of a-acyclicity; we provide the definition that builds a tree out of the relations as it 
most naturally relates to generalized hypertree decompositions. 

► Definition 50. Given a hypergraph R = {y-H,£'H), a join tree over R is & tree T = {Vr,£r) 
with Vp = £-r such that for every attribute A £ Vr, the set {n e VrlA G i'} forms a connected 
subtree in T- 

A hypergraph R is a-acyclic if there exists a join tree over R [^[^. We can use the classic 
GYO algorithm to produce a join tree ch.6]. The Yarmakakis algorithm takes a join tree as 
input. It’s pseudo-code is given in Section [3T| 

► Theorem 51. Algorithm^ runs in 0(IN -|- OUT) where IN and OUT are the sizes of the 
input and output, respectively. 

To leverage the speed of Yannakakis for cyclic queries, we look to GHDs m- The intuition 
behind a GHD is to group the attributes into bags (as specified by the function y) such that 
we can build a join tree over these bags. This allows us to run GJ within each bag and then 
Yannakakis on the join tree. The resulting algorithm is GHDJoinwhose pseudo-code is given in 
Algorithm il The runtime of GHDJoin is given by -k OUT) 

► Theorem 52. runs in _^OUT). 
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We can make some straightforward modifications to the above join algorithms to perform 
aggregations. The traditional Yannakakis and GHDJoin algorithms perform the join in a bottom 
up fashion, after a semijoin phase to ensure that there are no dangling tuples. The modified 
algorithms above handle aggregations using the same intuition as in traditional query plans: 
“push down” aggregations as far as possible. Since each attribute must occur in a connected 
subtree of the GHD, we can push its aggregation down to the root of this connected subtree, 
which is the TOP node of the attribute. There is a standard modification to Yannakakis for 
project-join queries that projects away attributes at their TOP node |^. Instead of projecting, 
we perform aggregation. 

We provide the pseudo-code of Aggro Yannakakis, which is a simple variant of the well-known 
Yannakakis algorithm, in Algorithm]^ Algorithm [^gives the pseudo-code of AggroGHDJoin, 
which is a variant of GHDJoin that calls Aggro Yannakakis instead of Yannakakis. AggroGHDJoin 
also does some extra work to ensure we pass each annotation to GJ only once. The operator 
in AggroGHDJoin denotes a projection that projects tuples while replacing the annotation by 1, 
to ensure that the same annotation isn’t counted more than once. 


Algorithm 4 AggroYannakakis(T = (V,£’), a, G V}) 

Input: Join tree T = (V,£), Aggregation order a, Relations Rp for each F gV 

for all T’ G V in some bottom-up order do t> Semi-join reduction up 

P ^ parent of F 
Rp i — Rp K Rp 
end for 

for all T’ G V in some top-down order do > Semi-join reduction down 

P ^ parent of F 
Rp •<— Rp K Rp 

end for 

while F G V in some bottom-up order do > Aggregation 

^ ^ a n {a G V|TOPr(a) = F} 

R' •<— FifjRp 

if F is not the root then 
P parent of F 

Rp Rp M R' [> Compute the join 

end if 
end while 

return Rp for the root R 


Algorithm 5 AggroGHDJoin(-H = {V'h,£h), {T{Vt, £t), x) , {Rf\F G Sp}) _ 

Input: Query hypergraph T-L, GHD (T, x), Relations Rp for each F G £p 
Sr <— 0 

for all t G Vp do 

■Ht ^ G Sh}) 

I ^ {Rp\F C xW,3a G F : TOPr{a) = t} U {Tr^^^^Rp\F ^ xW or Va G F : 
TOPria) ^ t} 

Sr ^ SpU GJ{'Ht, I) 

end for 

return AggroYannakakis(T, Sr) 
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In the classic analysis of Yannakakis, the rnntime of the semi-join portion is bonnded by 
0(IN) and the bottom-np join is bonnded by O(OUT). In AggroYannakakis, the analysis of 
the semi-join portion is nnchanged, bnt the aggregation rednces the size of the ontpnt, thereby 
making the OUT bonnd harder to achieve. In particular, during the bottom-up join, we may 
compute an intermediate relation whose attributes are not a subset of the output attributes, 
meaning that its size may not bounded by OUT. These potentially large intermediate relations 
are the underlying cause for the traditional IN • OUT runtime. 

However, using intuition discovered in reference |^, if we require the output attributes to 
appear above non-output attributes, we can preserve the IN -|- OUT runtime. 

► Theorem 53. Suppose we have a GHD such that for any output attribute A and non-output 

attribute B, TOPp{B) is not an ancestor ofTOPp{A). AggroGHDJoin runs in -|- 

OUT) given this GHD. 

Proof. GJ on each bag still runs in We need to prove the Yannakakis portion 

runs in 0(IN -|- OUT) after running GJ. 

The semijoin portion runs in 0{IN) as in the original Yannakakis algorithm. In the join 
phase, we have two types of joins. In the first type, T \ /3 C P. This implies the join output is 
a subset of Rp (with different annotations). So the total runtime of this type of join is 0(IN). 
For the second type, F \ /3 C P. This means some attribute in (P \ /3) \ P must be an output 
attribute, and all attributes in P must be output attributes as well (as their TOP value is an 
ancestor of F). So the result of our join must be a subset of the output table; the total runtime 
of this type of join is O(OUT). Thus the total runtime of the algorithm is 0(IN -|- OUT). ◄ 

Note that while AggroGHDJoin runs in -|- OUT) time on the GHDs above, it 

may not necessarily produce the right output unless the GHD satisfies additional conditions, to 
ensure that aggregations can be done in the proper order. In particular, recall our dehnition a 
GHD (T, x) is compatible with a if for all attribute pairs A, P, TOPp{A) being an ancestor of 
TOP'r{B) implies that either A is an output variable or A occurs before B in a. 

► Theorem 54. If a GHD (T, x) compatible with a, then AggroGHDJoin given (T, x) correctly 
computes Qjip. 

Proof. We first show that AggroYannakakis works as expected. We note that the semi-join 
reduction does not change the output; it only quickens the process. We only consider the bottom- 
up join. For each node t in the join tree, let R{t) be the relation associated with that node before 
this loop (i.e. after the semi-join portion). Let R'{t) be the hnal relation associated with node t 
when we are processing node t (i.e. after the bottom up join with t’s descendants is done, and 
after the aggregation in t). Let 7t be the subtree that includes t and all of its descendants. Let s{t) 
be the attributes aggregated at node t, i.e. a fl {a £ V|TOP 7 -(a) = t}, and let s{Tt) = UteTtS{t). 
For each non-leaf node t, let c(t) be the set of t’s children. 

For each node t, we claim R'{t) = ^ Xt'eT* Rit'). Proof by induction on the tree. For 

each leaf I, R'il) = Ril) by definition. 

For a non-leaf node t. 


R'{t) = X! (^tcec(t) R'{tc)) 



R{t’) 
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The second step is due to the inductive hypothesis. The hnal step is simply “pulling out" the 
aggregations from the sub-orderings one at a time; we can arbitrarily interleave the aggrega¬ 
tion orders y We can simply interleave them to match (=c(t)s{Tt)- Since the original 
GHD is compatible with a, we know the aggregations as(t) precede 03 ( 7 ; ) in a, implying that 
W W = W . Our output is R'(tr) where G is the root node, which is 

as desired. 

Since AggroYannakakis works as expected, we simply need to ensure that the bags are com¬ 
puted appropriately. Note the GHD ensures for every relation Rp, there is a node t such that 
R R xit)- This means that no tuple is lost; computing Aggro Yannakakis on the bags will com¬ 
pute the correct tuples. To ensure it computes the correct annotations, we need to ensure every 
annotation appears in the bags at most once; our algorithm places the annotation of a relation 
Rp in the top-most node that contains all of the attributes Rp. ◄ 


Product Aggregations: When product aggregations are present in an Ajar query Qu.a-, 
we have a notion of product partition hypergraphs, AGHDs over product partition hypergraphs, 
and a corresponding notion of AGHDs compatible with an ordering. We now prove theorem 
that extends theorems |53| and |54| to the case where product aggregations are present. 

A product partition partition P = (Vp,Sp) essentially creates multiple renamed copies of 
each product attribute a (ai, 02 ,..., a|p„|), and assigns one of the renamed copies to each relation 
containing a. An AGHD is essentially a GHD over P. Given P, and a E Vn, let P(a) equal {a} 
if a is not a product attribute, and {oi,..., a|p^|} otherwise. Given a' E Vp, let P~^{a') equal a 
such that a' E P{a). Given an edge F E £p, let P~^{F) denote the edge {P~^{a') \ a' E F}. We 
dehne a modihed ordering over Vp that takes a and replaces each occurrence of (a, ®) with 
(oi, ®),(a 2 , ®),- ■ -XniPal, ®) for each product attribute a. For any F E £p, we dehne the relation 
Rp to be same as the Pp-qpj (but with the attribute name changed. This gives us the modihed 
Ajar query Qp^p = ^FeSp Rf- Then we have, 

► Lemma 55. Suppose R'{A,Ci) is a copy of R{A,C) with C renamed to Ci, and S'{B,C 2 ) is 
a copy of S{B,C) with C renamed to C 2 - Then 

E E R'{A,Ci) N S'{B,C2) = R{A,C) M S{B,C) 

(Ci,®)(C2.®) (C,®) 

Proof. Suppose the annotations for the C values in R are ni, 7121 ■ • ■ 1 Rfc and in S are mi, m 2 , ..., mj, 
(assume all annotations are present i.e. absent tuples have a zero-annotation). Then the RHS is 
®\^inimi. The LHS will have (8)jLjni RHS is equal to the LHS because of idem- 

potence of ®. Note that if ® wasn’t idempotent, the LHS would have the mj terms multiplied 
k times while the RHS has them once. ◄ 


Lemma 56. For each database instance I, Qn.a{I) = Qp^ri^)- 


This lemma can be proved by repeated application of Lemma 

Suppose we have a AGHD D = (T, Xi P) which is 


Now we can easily prove Theorem 45 


compatible with an ordering a. Then the GHD (T, x) over hypergraph P, is compatible with a^. 
Running AggroGHDJoin over this GHD, with ordering correctly computes Qp^p{I), due to 
theorem 54 And by Lemma 56 this also equals Q-H,a{I)j which is the output we want. Also, 


since the AGHD is compatible with a, the GHD must satisfy the condition of Theorem and 
hence AggroGHDJoin runs on it in time 0(IN^^™ -|- OUT). 
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I B I Comparison with Related Work 
B.l Section!^ 


In Section]^ we define a simple approach to solving Ajar queries, and we claim in Theorem [T^ 
that our runtime guarantee of 0{IN'"’ + OUT) < We note that the faqw exponent 

is actually the optimum value of faqw{a) over the equivalent orderings a they consider (we 
discuss the space of orderings they consider in the next subsection). Our approach will recognize 
a as being equivalent, and will search for the best compatible GHD for a. We will show that 
there exists a compatible AGHD (T, Xj P) for every equivalent ordering a such that fhw{T, T-i) = 
faqw{a) (as Example 58 shows, the compatible AGHD T we construct may not be the optimal 


compatible GHD). 

We start by briefiy summarizing FAQ’s algorithm, with the pseudo-code (written in the 
notation of this paper) given in Algorithm Let a be the ordering used for aggregation. Let n 
denote the total number of attributes |V-h| and / denote the number of output attributes (thus 
\a\ = n — /). For notational convenience, we will be using (j[i\ to denote both the attribute and 
the operator that make up the operator-attribute pair in the ordering. 


Algorithm 6 InsideOut(7t = {Vht^h)^ {Rf\P G £-h}) 

Input: Hypergraph H = {V-h,£'h)^ Aggregation ordering a, Relations Rp for each F G 6-h 

En t— {Rp I F G S-h} 

for {k = n; k > f; k -) do 

S{k) t— {Rp G Ek I cr[/c — /] G F} 
if a{k — /] is not a product aggregation then 
Uk t —R 
Ek-I G- {Ek \ S(k)) U {J2a[k-f] 
else 

Ek-i G- {Ek \ 5{k)) U {J2a[k-f] R \ '^(^)} 

end if 
end for 

return MRgSj R 


FAQ relies on a worst-case optimal algorithm to compute each of the joins, implying that in 
the 0(/A'f“5“’) runtime guarantee, faqw is defined as the maximum AGM bound placed on each 
of the computed joins. Define p{j : 2^'” -s- 7?. to be a function that maps a subset of the attributes 
to the AGM bound on the subset (i.e. the optimal value of the canonical linear program). Then 
faqw = m&^{ma3ikP*H{Uk),V*H{V{-(y))) [iS] . 

We will build up the compatible AGHD (T, Xi P) in rounds corresponding to each of the k 
values of InsideOut. We first describe how to construct (T, x)i nnd later describe how to obtain 
P. At the start of round corresponding to a particular fc, we wiU have a forest of AGHDs, each of 
which will have a root mapped (by x) to the attribute sets of Ek, and at the end of each rormd, 
the forest’s roots will be mapped to the relations of Ek-i- 

For an attribute set F, let t{F) represent the node such that x{t{F)) = F. We start by 
creating the \£-h\ nodes {t{F)\F G F„}, which are simply nodes mapped to the input relations. 
Then for each k from n to / -f 1, let T represent the set of nodes {t{F)\F G 5(n.)}; these are 
the nodes that will be processed (i.e. the nodes for whom we will create parents). If a[k — f] is 
not a product aggregation, we create a node t{Uk) and set parent{t) = t{Uk) for all t G T. We 
then create a node t{Uk\{(j[k — /]}) and set it to be parent{t{Uk))- Note that this process has 
transformed the set of the forest’s roots by removing T and adding t{Uk\{o'[k — /]}), mirroring 
the transformation between Ek and Ek-i- If cr[fc — /] is a product aggregation, then for each 
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F £ 5[n), we create a node t{F\{a[k — /]}) and set it to be parent{t{F))-, in this case as well 
the set of the forest’s roots match Ek-i- 

At the end of this process, we will have a forest of AGHDs whose roots map to the relations 
in Ef. To conclnde onr construction, we simply construct the node t{V(—a)) and set it to be 
parent(t(F)) for all F £ Ef. If there are no product aggregations, then (T, x) forms a GHD. 

(T, x) satishes the running intersection property for all non-product attributes, but a product 
attribute a can be present in multiple disconnected parts of T. We now describe a product 
partition P such that (T, Xj P) forms an AGHD for the ordering a. Let Pa denote the number 
of distinct connected components of T in which a is present. Then we create Pa copies of a (ai, 

02 ,.. .,a|p^|), and assign a copy to each component in some order. For each P E £-h that contains 
a, if the component that t{E) belongs to is assigned a^, then P assigns to F. Then [T,X^P) 
is an AGHD for a. 

The AGHD (T, X> P) s-s described is trivially compatible with a since we construct parent(TOP'r{a[k— 
/])) explicitly in round fc; this ensures that TOP'i-{<j[i\) cannot be an ancestor of TOPf-{<j\j\) if 
i > j. 

► Lemma 57. Define p^ to be a function that maps a set of attributes to the AGM bound on 
the set (the optimal value of the canonical linear program). The AGHD {F,XjP) constructed as 
described satisfies 


fhw{T,'H) = max(p^(H(-a)),maxp^(C/fc)) = faqw{a). 

k 

Proof. The nodes in our tree that do not map to V[—a) or the Uk either map to an input 
relation or to a relation created by aggregating an attribute from a single child node. In the 
former case, would evaluate to 1, so we can ignore them in our maximum. In the latter case, 
the attributes are a strict subset of its child’s attributes, implying we can ignore them too. As 
such, the fractional hypertree width is simply the maximum fractional cover over V{—a) and the 
Uk- This shows the hrst part of the equality. 

The second part of the equality is the dehnition of faqw [^. ◄ 

Theorem as well as its analogue for product aggregations, follow as a simple corollary. 
We now show an example where the rrmtime of InsideOut is much worse than the runtime of our 
Algorithm, primarily due to the fact that it is not output-sensitive. 

► Example 58. Let n be an even number, and consider an Ajar query Qu.a where Ti = {{Ai 
I < i < n}, {{Ai, Ai+i} |l<i<n — 1}U {{A„, Ai}}), and a is empty (i.e. the query is just a 
join). Also let each attribute take values 1 , 2 , 3,... 2 x . Suppose each relation {Ai, Ai+i} 

for 1 < i < n — 1 connects values of the same parity, while relation {A„, Ai} connectes values 
of opposite parities. Thus each relation has size Ai, and IN = 0{N) (n is a constant), and the 
join output is empty. There is a GHD with bags {Ai, A 2 , A 3 }, {Ai, A 3 , A 4 },... {Ai, A„_i, A„} 
that is compatible with the empty ordering. The fhw of this GHD is 2, so we have w* = 2. Thus 
the runtime of our algorithm will be O(IN^). InsideOut will compute an intermediate output 
consisting of the join of n — 1 of the relations, which has size , so InsideOut’s runtime 

will be at least 0 (INfo“^^/^). 

FAQ does discuss, at a very high-level and without proofs, changes to InsideOut that will 
allow their runtime to be output-sensitive Section 10.2]. Their most general and useful change 
involves building a GHD for the output variables and running a message passing algorithm 
between the bags, which exactly describes GHDJoin. Implementing this change would make 
InsideOut completely equivalent to AggroGHDJoin. We note that the FAQ paper frames these 
changes as decisions in how to represent the output, whereas we present the optimization in an 
algorithmic context, independent of any other storage optimizations. 


30 


Aggregations over Generalized Hypertree Decompositions 


B.2 Sections] 

In Section]^ we define a partial order <-H^a that exactly characterizes the constraints an aggrega¬ 
tion ordering must satisfy to be equivalent to a given ordering a. Our partial ordering is complete, 
which is a result that FAQ cannot match. Much like our approach, FAQ actually defines their 
own partial ordering, which we denote <faq, and their work only considers orderings that are 
linear extensions of <faq- However, we will show an example where <faq has unnecessary 
constraints: 

► Example 59. Consider the Ajar query given by max^ i?(A, iJ)5'(A, (7). By our 
characterization, A <-u,a B is the only constraint, giving rise to 3 different valid orderings. The 
FAQ characterization, however, has two constraints: A <faq B and C <faq B, which only 
allows for 2 different valid orderings. Note that FAQ constraints preclude the original ordering 
ABC. 

B.3 Section [5] 

In Section we define a decomposition that relates the width of a valid GHD to the widths 
of a series of ordinary GHDs. Variable orderings (as used by FAQ) are not as readily suited 
as GHDs are for decompositions. FAQ does derive their own version of a decomposition, but 
the difficulties that arise when using variable orderings are exemplified in the way FAQ switches 
between GHDs and variable orderings in their proofs [^. In addition, the FAQ decomposition is 
demonstrably weaker than ours; their decomposition incurs some overhead costs when combining 
the sub-orderings to build the overall ordering, precluding a result like Corollaryj^that provides 
the groundwork for the variety of extensions we provide. To exemplify the gap in the two 
decompositions, we inspect a specific Ajar query: 

► Example 60. Consider the query i?(A, C')T(C', D), [/(D, A). Suppose 

|A| = ^/N, \B\ = 2 = \D\, \C\ = N, and all of the pairwise relations are constructed as 
complete cross products of the attributes’ values. Our decomposition will result in the chain 
GHD A — ABD — BCD, while the FAQ decomposition will result in the GHD A — ABC — ACD. 
The runtimes of both FAQ and GHDJoin using the former GHD is 0{N), whereas the runtimes 
using the latter GHD are As such, the FAQ decomposition will perform asymptotically 

worse than our decomposition. 

More generally, consider a query Qn with relations Ri[Ai, B), Si(B, Ci), Ti(Ci, D), Ui(D, Ai) 
for 1 < i < n. Like before, all |Ai|’s are \/N, all [Gil’s are N, and |i?| = \D\ = 2. And 
the aggregation ordering only has the -f operator, on B, D and all Gi’s. Our decomposition 
gives the chain Ai... A„ — Ai ... A^BD — Gi ... CnBD. This results in a runtime of 0(N‘^). 
FAQ’s decomposition gives Ai ... A^BCi ... G„ — Ai... AnDCi... G„. This decomposition, and 
its corresponding ordering, give a runtime of 0(A^^"/^). Thus the difference between runtime 
exponents caused by FAQ’s decomposition and our decomposition can be arbitrarily high. 

B.4 Section [531 

► Example 61. Suppose we a Ajar query Q-H,a with 

n = {{A,B,C,D,E,F},{{A,B},{B,C},{B,D,E},{D,F}}) 

and a = {{D, ^), {E, ^), (F, X)))- We start with the width-1 valid GHD (T, x) with V — T = 
{vi,V2,V3,Va} and 

= {t'i,'y2},{w2,n3},{?;3,W4} 

such that Ui is the root, and x(^^i) = {^:B}, x(^ 2 ) = {B,C}, = {B,D,E}, x(''^ 4 ) = 

{D,F}. 
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Applying Log-GTA gives us a shorter GHD (T',xO with V!j- = {u,vi,V2,V3,V4}, 

= {{«!.«}: {u, V 2 }, {u, wa}, {u, 'y4}} 

with Vi as the root. x'(u) = {B,D} and = xi'^Ji) for all i. Now TOP{D) = u which is 

an ancestor of TOP{C) = V 2 , despite C being an output attribute and D not being an output 
attribute. This means GHD {T',x') is invalid, showing that applying Log-GTA to a valid GHD 
may make it invalid. 

As the above example shows, we cannot directly apply Log-GTA to a valid GHD to get a 
shorter valid GHD. 


c 


Characterizing Equivalent Orderings: Proofs 


We now formally present our partial order <-H,a that characterizes the interaction of the two 
forms of commuting. As we said in Section]^ we have two relations PREC and DNC, that are 
mutually recursive. We initialize the constraints to a base case and iteratively update them till 
we reach a hxed point. We now formalize this. We use binary operator ^ to denote the 
constraint PREC after i iterations, and operator ^ to denote DNC after i iterations, with one 
difference; both operators behave slightly differently for output attributes. To readily incorporate 
output attributes into the constraints, we dehne an augmented aggregation ordering below: 

► Definition 62. For any aggregation ordering a, let F be the set of output variables. Then 
dehne a® = of, q®, ..., a® to be a sequence such that af = (Fi,NULL) for 1 < i < 1^1 and 
af = oti+\F\ for 1^1 -P 1 < i < n. 

Note that n is dehned to be the number of attributes in the query. Now we can formally 
dehne ^ and Both of these binary operators operator over attribute-operator pairs, 

but since each attribute occurs at most once in an ordering, we can equivalently think of them 
as operating over attributes. We use these two interchangeably e.g. A <n,a B denotes the same 
thing as (A,©) <n,a (H,©'). 

► Definition 63. For a given query Q'n.a with T-i = (V,if:), we dehne relations ^ and partial 

orders ^ over attribute-operator pairs in For any A,BeV, suppose {A, ©), {B, ©') £ 
Then, for i = 0 , (A, ©) {B, ©') if and only if one of the following is true: 


_ © / ©' and 3E e £ : A,B e E. (0.1) 

_ © ^ ©' and either © = NULL or ©' = NULL (0.2) 

For i > 0, (A, ©) „ (B, ©') if and only if (A, ©) 9 ^^ „ (B, ©') for all j < i and one of the 

following is true: 

- © 7 ^ ©'and 3B £ ^,((7,©") e : B,^ £ B, (A,©) (G,©") (i.l) 

- 3(G, ©") £ a® and j,k<i: (A, ©) „ (G, ©") (B, ©') (i.2) 


For any i > 0, (A, ©) (B, ©') if and only if (A, ©) (B, ©') and (A, ©) precedes (B, ©') 

in . 

Finally, (A,©) (B,©') if and only if (A,©) „ (B,©') for some i > 0. Similarly, 

(A, ©) <u,a (B, ©') if and only if (A, ©) „ (B, ©') for some i > 0. 

The core of our dehnition is the four labeled conditions for ~. The condition 0.1 represents 
the simplest structure that violates both conditions of Theorem it represents our base case. 
Condition 0.2 simply ensures the output attributes precede non-output attributes. Our condition 
i.l extends the structure from 0.1 beyond single relations. If A < G and G appears in a relation 
with B, we can guarantee that A and B cannot be separated in the way the second condition of 
Theorem [^requires, and if © 7 ^ ©', the hrst condition is violated as well. Condition i.2 simply 
ensures that transitivity interacts properly with condition iA. 
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We now prove the two lemmas stated in Section followed by proving soundness and com¬ 
pleteness of <H,a- 


► Lemma 64 (Copy of Lemma 231 . Suppose we are given a hypergraph T-i = (V,i?) and an 
aggregation ordering a. Fix two arbitrary attributes A, B E V such that (A, ©), {B, ©') e for 
differing operators © ©C Then, for any path P in TL between A and B, there must exist some 
attribute in the path C E P such that C <'M,a A or C <-H,a B. 


Proof. We use induction on the length of path P. 

Base Case: Let |P| = 2. This implies that there exists some edge E E £ such that A,B E E. 
Thus A ^ B. Then, by definition, either A <-H,a B or B <u,a A depending on which 

attribute appears first in a. 

Induction: Suppose |P| = > 2 and assume the lemma is true for paths of length < N. We 

call this assumption the outer inductive hypothesis, for reasons that will become apparent later. 
Path P can be rewritten as P = AP'B where P' is a path of length at least 1. Let C be the 
node in P' that appears earliest in this implies that there exists no attribute in our path 
D E P' such that D <-H,a C. Dehne an operator ©" such that (C, ©") E . Since © ©', 

either © / ©" or ©' ©". Without loss of generality, assume that © / ©". 

Consider the subpath of P from A to C. It is shorter than N and connects two attributes with 
different operators. We apply our inductive hypothesis to get that there exists some D E P such 
that either D <-H,a A or D <'H,a C. In the hrst case, we have found an attribute that satisfies 
our conditions and we are done. In the second case, we know that D ^ P' hy our definition of C. 
Thus D must be A; we have that A <H,a C. 

Consider the subpath of P from C to P; let Xi denote the node in this path for 0 < z < fc, 
where Xq = C and X^ = B. We claim that for all i < k, A <'n,a Xi. We argue this inductively; 
for our base case, we are given that A <H,a C = Xq. Now let z > 1, and assume A <'H,a Xj for 
j < i. Call this the inner inductive hypothesis. 

Note that we have A <'H,a G and that C must precede Xi by definition. Thus A precedes Xi 
in . All that remains is showing that A Xi. Define ©* such that (X^,©*) E cfi. Since 
we assumed earlier that © / ©", we know that either ©* / © or ©* / ©". 

- ©* 7 ^ © 

By our (inner) inductive hypothesis, we know that A <'H,a Xi-i. We also know that there 
must exist some edge E E £ such that Xi_i, Xi E E. Thus by condition z.l, A ~'H,a Xi. 

- ©* / ©" 

By our (outer) inductive hypothesis, we know that for some 0 < j < z, Xj <'n,a G or 
Xj <-«,« Xi. By our definition of G, the first case is impossible. And by our (inner) inductive 
hypothesis, we have that A <H,a Xj. We thus have that A <'n,a Xj <u,a Xi, which implies 
that A Xi by condition z.2. 

This gives us that A <H,a X^-r. Since there exists an edge E E £ such that Xk-r,B E E, 
condition z.l tells us that A ^H,a B. As before, this implies that either A <'n,a B or B <-H,a 
A. ' ’ ◄ 


► Lemma 65 (Copy of Lemma 24 1. Given a hypergraph B = {y,£) and an aggregation ordering 
a, suppose we have two attributes A,Be V{a) such that A <-H,a B. Then there must exist a 
path P from A to B such that for every C E P,G A we have A Kn.a G. 


Proof. Dehne © and ©' such that {A, ©), {B, ©') E a. In addition dehne z such that A <\j ^ B, 
which implies that A precedes B in a and that A ^ B. Our proof is by induction on z. For 
our basecase, if A ^ B, we know that 3E E £ : A, B E E. Thus the path P = AB satishes 
our conditions. 

For z > 0, we have the following cases: 
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_ © / ©' and 3E e S, {C, ©") ea^ : B,C e E,A C 

By our inductive hypothesis, there must exist a path P' from ^ to C* such that for all 
D £ P’, D^Awe have A <-H,a D. Then the path P = P'B satisfies our conditions. 

_ © / ©' and 3E £ S, (C, ©") e : A,C e E, B C 

By our inductive hypothesis, there must exist a path P' from B to C such that for all 
D £ P', D ^ B we have B <-H,a D, which also implies that A <u,a D by i.2. Let P' be the 
reverse of P'. Then the path P = AP' satisfies our condition. 

B 3(7 £ V and j,k < i : A C ^ B 

By our inductive hypothesis, there must exist two paths P' and P". P' is a path from A to 
C such that for all D £ P',D ^ A we have A <-u,a D. Similarly, P" is a path from C to 
B such that for all D £ P", D ^ C we have C Kn.a D, which implies A <-u,a D. Thus the 
path P = P'P" satisfies our conditions. 

•4 


► Theorem 66 (Copy of Theorem 141. Suppose we are given a hypgergraph T-i = 
aggregation orderings a, (5. Then a ='n P if and only if P is a linear extension of <n 


(V,i5) and 

,a • 


Proof. Soundness: 

We use induction on the number of inversion in P with respect to the ordering a. Base Case: 
0 inversions. Then P is identical to a and a =H P- 

Induction: Suppose P has N > 0 inversions, and assume the lemma is true for orderings with 
< N inversions. There must be some Pi and Pi+i that are inverted with respect to a. Consider 
the ordering P' derived by swapping Pi and Pi+\. It has — 1 inversions with respect to a and 
is clearly a linear extension of <n,a- Thus, by the inductive hypothesis, a =-u P'. 

We now show that P =u P'■ Suppose Pi = {A, ©) and Pi+i = (B, ©'). We have two cases to 
consider. 


By Theorem we can swap Pi and Pi+i without affecting the output. This implies that 

P P'- 

© / ©' 

By Lemma 23 and since we know A and B are incomparable under <'n,a, any path between 
A and B must go through some attribute C such that C <-H,a A or C <'n,a B. Since P is 
a valid linear extension of <n,a-, these attributes C appear earlier than index i in p. This 
implies that A and B are in separate connected components in which 

implies that we can swap Pi and Pi+i without affecting the output by Theorem This 
implies that P =n P'■ 


Completeness: 

We prove the contrapositive: we assume that we are given aggregation orderings a, P such 
that P is not a linear extension of and we will show that a P- We will do so by 

constructing an instance I such that Qn,ail) 7 ^ Qh,0{I)- 

We assume without loss of generality that V = V{a), i.e. that there are no output attributes. 
We will provide an example where P and a must differ in the single annotation that comprises 
the output. If there are output attributes, we can augment our example by putting Is in all the 
output attributes; our output will be composed of a single tuple composed of all Is with the same 
annotation as in our example below. 

Consider the set of all valid linear extensions of <n,a- Suppose the maximum length prehx 
identical to the prehx of P is of size k. Among all linear extensions with maximum length identical 
prehxes, suppose the minimum possible index for Pk+i is k'. Consider a linear extension a' such 
that a'i = Pi tor i < k and = Pk+i- By the soundness part of our proof, a' =-h a; to show 
that a P we can simply show that a' P- 
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Suppose aj., = (A,©) = Pk+i and = (-B,©'). We know that B <H,a A since k' is 

the minimum possible index for Pk+i in any linear extension of <n,a- Also, since B and A are 
adjacent in a', we know that there cannot exist any C such that B <'H,a G <^,01 A. These two 
facts combine to imply © ^ ©'. Then, by Lemma 24 there exists a path P from A io B such 
that every attribute in our path C E P other than A and B must appear after index k' in a'. 

Since © ^ ©', there must exist x,y E H such that x (B y x y- Dehne a relation Ry 
with two tuples. The hrst tuple will contain a 1 for each attribute and an annotation x. The 
second tuple will contain a 2 for each attribute in P (including A and B) and a 1 for every other 
attribute. The second tuple will be annotated with y. Note that among the attributes in P, 
(j 4, ©) is the outermost aggregation in /3 and (P,©') is the outermost aggregation in a. This 
implies that 


= x®' y 


S/3iE/32 • ■ • '^P\i3\Rv = X®y 


Let C be the attribute in P right before P; by dehnition there must exist an edge E E S such 
that B,C E E. Consider the following instance over the schema %: 

I = {tteRv} U {'k\pRv\E E S,E ^ E}. 

By dehnition, Rp = Ry. Since we know that x ® y =/= x ®' y, we have that 


Pi? ■■■ ^/3|,3| 


We thus have that Qu,a'{I) 7 ^ which implies that a' P- 
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Decomposing Valid GHDs: Proofs 


We start by stating and proving a useful lemma about the aggregation orderings seen in the 
sub-trees of a decomposable GHD. 


► Lemma 67. Given a hypergraph R = (V, £) and an aggregation ordering a, suppose C is a 
connected component of R\V{—a). Define Re = ({JpeSc For any A E C, B E V, if 

A <'H,a B then A <'Hc,ac\co ^Similarly, if A <^0 ,ac\cc B, either A E C® or A <-H,a B. 


Proof. First we show that for any A E C, if A <-H,a B then P e C\C'^. If B E , then by 
dehnition, A B. If P ^ C, then every path between A and C must go through attributes 

in V(—a). Thus, by the contrapositive of Lemma 24 A B. This implies that A <'H,a B 
only for P e C\C'°. 

Note that for any A E , since A is an output attribute in ac\co, A B for all 

P E C\C^. This proves our lemma for A E . 

For A E C\C^, we prove the lemma by showing that for any i > 0, {P|A B} = 

iB\A <\j „ P|. Note that our earlier result shows that both of these sets are subsets of 

L I nc,ckc\c<^ 

C\CO, so we know that for any i, any P in either set appears in both aggregation orderings. 

Proof by induction on i. We hrst consider the base case: i = 0. We note that since A 
is not an output attribute condition (0.2) is irrelevant. Since Rc contains all edges involving 
attributes in C and ac\ico preserves the ordering and the operators of elements of a, condition 
(0.1) applies to the same set of attributes in both Ajar queries Q'H.,a and Q-h. 

{B\A P} = {P|A B}- 


C.OCc\cO • 


Thus 


Recall 8c = {E e8\Er}C ^ 0}. 


8 
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For z > 0, the inductive hypothesis supposes {B\A ^ B} = {B\A „ o 
j < i. Again since He contains all edges involving attributes in C and acyc^ preserves the 
ordering and the operators of elements of a, the inductive hypothesis trivially implies conditions 
(i.l) and (z.2) apply to the same set of attributes. ◄ 


► Theorem 68 (Copy of Theorem [2^. Every decomposable GHD is valid. 


Proof. Suppose the Ajar query is Q-u.a- We need to show for any A,B such that TOP'j-{A) is 
an ancestor of TOPt-(B), A ■ft'u.a B. Proof by induction on |a|. If |a:| = 0, all GHDs are valid 
and decomposable. For \a > 0|, we note To ensure the output attributes are above non-output 
attributes. 

If A and B are non-output attributes and TOP'i-(A) is an ancestor of TOP'r{B), then both 
are in some 7c- By the inductive hypothesis, 7c is valid with respect to Q{UEg£pB.£c).ac\co • 
By Lemma [67| this implies A B. 


► Theorem 69 (Copy of Theorem 291. For every valid GHD (T, x); there exists a decomposable 
GHD {T',x') such that for all node-monotone functions 7 , the ^-width of (T\x') ts no larger 
than the x-width of (T, x) • 


In the proof sketch provided in Section]^ we claim to have width-preserving transformations 
of a GHD that can enforce two additional properties, which we now present and name: 

B TOP-unique: every node t E T is TOP-fiA) for exactly one attribute A 


B subtree-connected: for any node t E T and the subtree 7* rooted at t, the attributes {v £ 
V\TOPr{v) E Tt} form a connected subgraph of H 
We hrst have two lemmas proving the transformations required to enforce these properties are 
width-preserving. 

► Lemma 70. Given a valid GHD (7”, x) with x-width w, we can transform it to be TOP-unique 
while ensuring x-width < w. 

Proof. Dehne a function :T - 5 -2^ from nodes to sets of attributes such that TOP.^^(t) = 

{A\TOPt(A) = t}. 

First we eliminate nodes t eT such that \TOPif^ {t)\ = 0. We note, by dehnition, x{t) T 
x{parent{t)). This implies that we can simply remove t, connecting all of its children to parent{t) 
without violating any properties of the valid GHD. And the width is trivially preserved. 

Now suppose for some node t £ T, \TOPif^{t)\ = fc > 1. Let A\ be the attribute in TOPif^ft) 
that is earliest in the aggregation ordering. Let X = x(t) Cl x{po-'>'e.nt(t)). Then create a new 
node t' such that x{t') = {Ai} U X and add it to T between t and parent{t). All of properties 
of the valid GHD must still hold, and since the new node contains a subset of the attributes in 
t, the width must be preserved. Note that after adding this node, \TOPif^{t)\ = fc — 1; we can 
repeat this process until the set is of size 1. < 

► Lemma 71. Given a valid GHD {P,x) with x-width w that is TOP-unique, we can transform 
it to be subtree-connected while preserving TOP-unique and x-width < w. 

Proof. For any node t eT, dehne Vt = {v E V\TOPr{v) E Tt}- 

We proceed with a proof by (bottom-up) induction on the tree T. As our base case, we 
consider the leaves of T. Since I does not have any children, Vi must contains exactly one 
attribute, which is trivially a connected subgraph of H. 

Now we consider the subtree Tt rooted at some internal node t. Let A be the attribute such 
that TOPq-{A) = t. Let ci,C 2 ,...,Cfc be the children of t. By the inductive hypothesis, the 
subtrees rooted at these children satisfy all of the desired properties. We note that, by dehniton. 
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x(t)\A C x{pci-'rent{t)). For any child Ci such that and A are not connected in H, we can 
remove A from xO^a) and set parent{ci) to be parent(t). By doing so for all such children Ci, we 
ensure that Vt is a connected subgraph of T-i. Since A is not connected to Vc^, this transformation 
does not violate the properties of GHDs. Since we are not creating any new ancestral relationships 
between nodes, the transformation does not violate the properties of valid GHDs. Finally, the 
7 -width < w and TOP-unique properties are preserved trivially. ◄ 

We have thus established that we can transform any valid GHD to additionally satisfy TOP- 
unique and subtree-connected while preserving width. We now show that any valid GHD sat¬ 
isfying the two additional properties is decomposable. Gombined with the two lemmas above, 
this will complete the proof of Theorem Before we dive into the proof, we prove two helpful 
lemmas. 

► Lemma 72. Given an AJAR+ query Q-H,a, if A <'H,a B for A^B with identical operators, 
there must exist some C with a different operator such that A <-H,a C <n,a B. 

Proof. A ^ B for some hxed i. If A, B have identical operators, the only way A a B is 
via rule (i.2), which requires some C and j,k < i such that A ^ C ^ B. If this C has the 
same operator as A and B, we can repeatedly apply this rule until we hnd some attribute between 
A and B with a different operator (since both of the rules for i = 0 only apply to attributes with 
differing operators). ◄ 

► Lemma 73. Given an AJAR+ query Q-u,a and valid GHD (T < x)- Suppose A <-H,a B, A 
is not an output attribute, and TOPq-{A) is a top node only for A. Then, TOPq-{A) must be an 
ancestor of TOPp^B) in any valid GHD. 

Proof. Lemma [^implies that there exists a path from AtoB such that for every C in the path 
such that G / j4, j4 <H.a C. Let Cq,Ci, ... ,Ck represent the path, where Cq = A and Ck = B. 
We claim that TOP'j-{A) is an ancestor of TOP'j-(Ci) for all 1 < z < fc. Proof by induction on 
i. Our base case is for i = 1. By the dehnition of a path, A and Ci must appear together in 
some hyperedge, implying that they appear together in some bag of T. Both TOPpiCi) and 
TOP'f(A) must either be equal to or an ancestor of this bag. Since TOPpiCi) cannot be equal 
to or an ancestor of TOPq-{A), TOPq-{A) is an ancestor of TOPq-{Ci). 

For z > 1, we note since Ci and Ci-i appear in an edge together, by the same logic as above, 
TOP'f(Ci) and TOP'i-[Ci-i) must both be equal to or an ancestor of some node t E T- By the 
inductive hypothesis, TOPt{A) is an ancestor of TOPT{Ci-i, implying that TOP-r{A) is an 
ancestor of t. Since TOPq-(Ci) cannot equal or be an ancestor of TOPq-{A), TOPp^A) must be 
an ancestor of TOPp{Ci). ◄ 

► Lemma 74. Any valid GHD {'T,x) TOP-unique and subtree-connected must be decom¬ 

posable. 

Proof. We actually prove a slightly stronger statement. Dehne the property TOP-semiunique as 
follows: every non-root node t E T is the TOPp node for exactly one attribute and the root node 
is either the TOPp for exactly one attribute or more than one output attribute (and zero non¬ 
output attributes). Note that the TOP-unique property directly implies the TOP-semiunique 
property. We will show that if (T, x) is a valid, TOP-semiunique, and subtree-connected GHD 
for the Ajar query Qu.a, it must be decomposable. 

Proof by induction on |a|. If |q = 0|, then every GHD is decomposable. 

Suppose \a > 0|. Gonsider the set of nodes that are TOPp nodes for output attributes, i.e. 
{t E T|3A e V{—a) : TOPp{A) = t}. Since no non-output attribute can have a top node above 
an output attribute’s top node, the TOP-semiunique property guarantees that this set of nodes 
forms a rooted subtree To of T such that x(7o) = V{—a). 


Manas Joglekar, Rohan Puttagunta, and Chris Re 


37 


Consider the subtrees in T\7o- Call them 7i, 72 , ■ ■ •,7^. For any %, let Vi be the attributes 
that have TOPj- nodes in 71, i.e. Vi = {A £ VlTOP-riA) 6 %}■ None of these Vi can contain any 
output attributes, and connected-subtree guarantees that each of the Vi are connected. Thus, 
the Vi must be the connected components of 'H\V{—a). So for each connected component C 
of T-L\V{—a), the corresponding subtree Tc is the subtree % such that Vi = C. Since for any 
A ^ C, TOPr{A) £ Tc, the attributes in C only appear in Tc- Note that for every edge E E £, 
there exists a node t E T such that E C x(^)- This implies that for every edge E £ Sc, there 
exists a node t E Tc such that E C xit)- As such, we can conclude that each 7c is a GHD for 
the hypergraph (Ucgfc 

Dehne Vc = Uce£c To complete this proof, we now need to show that each 7 c is a 


decomposable GHD for the Ajar query Q(Vc,£c),^ 


'C\CO ■ 


By the inductive hypothesis, if Tc 


is valid, TOP-semiunique and subtree-connected, it must be decomposable. Note that since 
T is TOP-semiunique and subtree-connected, Tc must also be TOP-semiunique and subtree- 
connected. We have also established that 7c is a GHD for {Vc,Sc)- Thus to hnish this proof, 
we only need to show that for any A, B E Vc such that TOPj^iB) is an ancestor of TOPj-^^A), 
A ^(Vc,£c).ac\co 

For ease of notation, in the rest of this proof we will use <c to represent <(Vc,£c),ac\co • 
show the contrapositive: if A <c B, TOP-j-^iB) is not an ancestor of TOP-j-^{A). We consider 
a few cases. If A £ Vc\C, A must be in V{—a), implying TOP-rdA) is the root of Tc- For any 
A £ G, note that TOP'j-^{A) = TOP-j-{A). By Lemma 67 for any A £ G, if A <c B then either 
A <-H,a B or A £ . In the former case, the fact that T is valid ensures TOP-y{B) is not an 

ancestor of TOPq-{A). For the latter case, assume for contradiction that there exist A,B such 
that TOPt{B) is an ancestor of TOPt{A), A <c B, and A £ . 

We hrst claim that, without loss of generality, we can suppose that A and B have different 
operators. To do so, we show that if A and B have the same operator, there must exist a B' 
with a different operator such that TOP-t{B') is an ancestor of TOP-j-{A) and A <c B'. By 
the dehnition of , there must exist some A' £ such that A' <H,a B- Since A', A £ G, 
there must exist a path exclusively in G that connects the two. And since A', A £ G*^, no 
attribute along the path precedes either A or A' in <^, 0 ,- The contrapositive of Lemma 23 
implies that A' and A must have the same operator, which implies that A' and B have the same 
operator. Lemmas and imply that there exists some B' with a different operator such 
that A' <'H,a B' <n,a B and TOP'j-{B') is an ancestor of TOP-j-{B). The former result implies 
B' £ C\C^ and A <c B'. The latter result implies TOP-r{B') is an ancestor of TOP-c^A). 

We now suppose A and B have different operators without loss of generality. Since A £ G'^, 
any O £ V such that O <'n,a A must be an output attribute, thereby implying O <^ 1,0 B as well. 
This fact, combined with Lemma implies every path between A and B must contain some D 
such that D <^,01 B. Since T is valid and TOP-semiunique, the TOP'r{D) for each of these D 
carmot be in the subtree rooted at TOP'f{B). This implies that A and B are disconnected in 
the subtree rooted at TOPt(B), contradicting the subtree-connected property. ◄ 


► Theorem 75 (Copy of Theorem 311. For an Ajar query Q'H,a, suppose Hq, . -. ,'Hk are the 
characteristic hypergraphs H{'H, a)- Then GHDs Gq,Gi, -.. ,Gk o/77o,...,Bfc can he connected 
to form a decomposable GHD G for Qu.a- Gonversely, any decomposable GHD G of can 
he partitioned into GHDs Gq, Gi, ..., Gk of the characteristic hypergraphs Hq, ..., Hk- Moreover, 
in both of these cases, 'y-width{G) = ma.XiX-ujidth{Gi)- 


Proof. Proof by induction on |a|. Our base case is |a| = 0. In this case, the only characteristic 
hypergraph is the input hypergraph, that is 77(71, a) = 77. The theorem is then trivially true. 

Suppose \a > 0|. Any decomposable GHD G for Q-H,a must be decomposable into subtrees 
To,---,Ti such that x('7o) = V{-a) and 71 is a decomposable GHD for Q{UE€ec.E,£c,),o^c \co 
where Ci is the d connected component of H\V{—a)- Dehne Vci to be Usggp B. To preserve 
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the running intersection property of a GHD, the root of Ti and its parent (in To) must contain 
the attributes Vci H V(—a). This implies each of the 7) are decomposable GHDs of Q-^+ „ ^, 

where is the hypergraph dehned in the definition of the characteristic hypergraphs. By 
the inductive hypothesis, the 7) (for i > 1) can be broken down into GHDs Gi,...Gfc of the 
characteristic hypergraphs Hi,... ,'Hk- In addition, To must also have nodes that contain the 
edge E E £ such that E C V{—a), implying it is the GHD Go of the characteristic hypergraph 
Ho- 

In the other direction, by the inductive hypothesis, the GHDs Gi,.. .Gk can be stitched 
together to form 7i,... 71 such that each 7) is the decomposable GHD for the Ajar query 
Q-H+ a o' that, by dehnition, for each i, Ti and Go must both have a node containing 

the attributes Vc^ H V{—a); let U and gi denote the appropriate node in 7) and Go, respectively. 
We can re-root 7) at U without violating any conditions since it amounts to re-rooting the top¬ 
most GHD of its decomposition; re-rooting T at U can only change the ancestor relationship 
between TOPj- nodes of output attributes. Once we re-root the 7) appropriately, we can simply 
set parent{ti) to be gi to generate a decomposable GHD for the Ajar query Qu,a- 


Product Aggregations (Detailed version) 

The primary application of queries with multiple aggregations is to establish bounds for the 
Quantified Gonjunctive Query (QGQ) problem A QGQ query consists of an arbitrary con¬ 
junctive query preceded by a series of (existential and/or universal) quantifiers, and a solution 
must report the satisfying assignments to the non-quantified variables. A jj=QGQ query is sim¬ 
ilar to a QGQ query, but instead of reporting satisfying assignments, we report the number of 
satisfying assignments. 

We now introduce a new type of aggregation, called product aggregation, that lets us effi¬ 
ciently handle QGQ queries. We define the Ajar problem for product aggregations, and then 
extend our algorithm from Section to handle this new type of Ajar query. 

E.l Ajar queries with product aggregates 

In order to recover QGQ as an Ajar query, we need product aggregations i.e. aggregations that 
use the 0 operator. Throughout the paper, we have assumed that an absent tuple effectively 
has an annotation of 0. To maintain this for product aggregations, we need to dehne product 
aggregation so that it returns 0 if any tuple is absent. In particular, we redehne ®) 
include a projected tuple tp\A m the output only if {tp\A ° ^a) exists in Rp for every possible 
value tA e More formally, let B = E\A: 

► Definition 76. RAB = {{tB.X)-'^tAeV^,tBOtAeRAB&rid\= A*} 

(A,®) Bt=tB 

Note that this adjusted dehnition implies an annotation of 0 is once again fully equivalent to 
absence. We can adjust the dehnition of aggregation orderings (and Ajar queries) to possibly 
include this new type of aggregation. We can construct valid GHDs for such aggregations as 
before, and run AggroGHDJoin to solve them. 

► Example 77. Gonsider the semiring ({0,1}, max, •). Note that in this domain max is equivalent 
to a disjunction (and the logical existential quantiher) and Jj is equivalent to a conjunction (and 
the logical universal quantiher). Thus the space of Ajar queries that use these two aggregators 
recover all QGQ queries. 

An aggregation using 0 is called a product aggregation, and an attribute that is aggregated 
using a product aggregation is called a product attribute. Aggregations that are not product 
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aggregations are called semiring aggregations, while attributes that are neither output attributes 
nor product attributes are called semiring attributes. 

Idempotence Assumption: Using the product aggregation as dehned raises one issue. Our 
semiring aggregates satisfy the distributive property, which is integral in our ability to push-down 
aggregations and for our results about commuting aggregations (Theorem]^. In general, product 
aggregations do not distribute: (a (g) &) (g) (a §5 c) = (a (g) a) (g) (& (g) c) 7 ^ a (g) (& (g) c). However if 
we require our product aggregations to be idempotent, that is that a (g a = a for any element 
a, our product aggregations will distribute. And for QCQ, the domain is restricted to {0,1}, in 
which product aggregations are idempotent. So in this section, we will study idempotent product 
aggregations; we will generalize to non-idempotent aggregations in Appendix |E.4| 

E.2 Solving Ajar queries with product aggregates 

For aggregation orderings that have product aggregations, the rules for determining when two 
orderings are equivalent are somewhat different. We now discuss how we can optimize this 
new type of aggregation further; product aggregations are fundamentally different from ordinary 
aggregation because we can do the aggregation before the join, as seen in the following example: 

► Example 78. In the semiring ({0,1}, max, •), suppose we have two relations R{A,B) = 

{((0,0), a;), ((0,1), y)} and S{B, C) = {((0, l),p), ((1,1), g)}. Consider the Ajar query E(s,.) B) n 
S{B,C). If compute the join, we will get two tuples with the annotations x -p and y- q, and then 
aggregating over B will produce a relation with the element ((0, l),x ■ p ■ y ■ q). However, note 
that X ■ p ■ y ■ q = {x ■ y) ■ {p ■ q), implying that Y.(b ■) C) = {J2(b ■) ■®)) ^ 

Now we describe our algorithm for solving Ajar queries when product aggregations are 
present. Our algorithm follows the same lines as the algorithm from Section |3.3| Recall that 
the algorithm consisted of searching for equivalent orderings, then searching for GHD compatible 
with an equivalent ordering, and running AggroGHDJoin on the GHD with the smallest fhw. For 
product aggregations, we need to modify our algorithm for testing equivalent orderings, and our 
dehnition of compatibility; we do these in turn. 

Testing orderings for equivalence 

Algorithm gives the pseudo-code for our equivalence test for orderings containing product 
aggregates. 

We have the lemma analogous to Lemma 

► Lemma 79 (Copy of Lemma [4^. Algorithm^ returns True if and only if a =-h P. 

Proof. Soundness: Suppose Algorithm [^returns true; we will show a =-h /?. We induct on the 
length |a|. For our base case, when |a| = 0, we return true when |/3| = 0. In this case, the two 
(empty) orderings are trivially equivalent. 

Suppose |a| > 0. We have two cases: when 'H\(y{—a) UPA{a)) has one component and when 
it has multiple components. We first consider the multiple components case. Let the components 
be Gi,..., Cm- Then we dehne C[,..., C'^ as in the algorithm i.e. For 1 < z < m, let £i be 
{E e £\E n Gi / 0}, the elements of £ that intersect with Gi. Then G' = Gi U IJEgg B Cl PA(a). 

We dehne £0 to be '?\(Ui<im(these are relations with only output attributes or product 
aggregations). Accordingly let Gq be the product aggregations that appear in £q. We can then 
express the following identities: 

XF6£ Rf =Xo<i<mX FeSi Rf 

NFe£ Rf = ^0<i<Tn ^ ^ ^FeSi Rp 

a 
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Algorithm 7 TestEquivalence("H = (V>^, £n), H) 

Input: Query hypergraph "H, orderings a, (3. 

Output: True if a =% /3, False otherwise, 
if |a| = |/3| = 0 then 
return True 
end if 

Remove V{—a) from "H, then divide % into connected components Ci,... Cm- 

if m > 1 then 

return AiTestEquivalence('H, aCi,Pci) 

end if 

Choose j such that /3j = ai. Let /3j = 

if Bi < j : /3i = ( 6 *, 0'), 0' yf 0' and there is a path from bi to bj in { 6 *, b^+i,..., 6 |q|} 

then 

return False 
end if 

Let (3' be /3 with /3j removed. 

Let a' be a with ai removed, 
return TestEquivalence('H, a', (3') 


The RHS may have a product aggregation (a, ®) happening in multiple components, but it 
happens exactly once per relation containing a. We note this identity holds for jS as well. This 
identity implies that a =-h (3 if Oc" =% Pc', for all i. We note that for i = 0, all of the aggregations 
contain the same operator, so any ordering is equivalent. For i > 0, we note that we return true 
only if all of the recursive calls return true, implying ac' =n Pc. by the inductive hypothesis. 

When 'H\(y{—a) U PA(a)) has one component, we choose j such that Pj = ai and define P' 
to be P with /?j removed. Note a' is defined to be a with ai = Pj removed. To show (3 =n «, 
we need to show a' =^1 P' and P =^1 PjP'■ Since we return true only when our recursive call on 
a' and P' returns true, the former equivalence holds by the inductive hypothesis. 

To show P =ji PjPP we ensure Pj and Pi can commute for all i < j. More specifically, we 
ensure that if Pj can be moved to index i + 1, it can be moved to index i. For any Pi with the 
same operator. Pi and Pj trivially commute. If Pi has a different operator, we know there is no 
path between their attributes bi and bj among the nodes 

{{bi, h,+i ,... , 6 |a|} \ PA(a)) U {bt,bj}. 

Let V be this set of attributes. Define Vi C E to be the set of nodes connected to bi in the 
hypergraph restricted to V (we know bj ^ Vi). Let £i be the set of edges that contain some 
attribute in Vi, i.e. {E 6 £\E n Vi 0}. We note that the attributes of E\yi do not appear in 
the edges of Let £2 = the attributes of E\yi all appear in £ 2 - We can then express 

the following identities: 


ipes Rp ={!^PeSi Rp) ix (xif 6£2 Rf) 


^Fe£ Rp = y]] ^Fe£i Rp ^ 


/3vuPA(a) 


y^/3viUPA(a) 


^FeS2 Rp 

y£v2UPA(Q) 

We note, by definition, that Pi and Pj must be pushed down into different aggregations in the 
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previous expression. This implies that we can commute Pi and Pj when they are adjacent, 
completing the soundness proof. 

Completeness: We prove that if Algorithm|^returns false, then there must exist a database 
instance I such that Q'H,a{I) / Qh,i 3 {I)- 

If Algorithmreturns false, there must be a component C, a' = ac, P' = Pc, such that 
Pj = oi, and there exists a i < j such that Pi = (&i,©'), Pj = {bj,(t)'j), ©' 7 ^ ©' and there is a 
path from hi to bj that consists of only hi, bj, and semiring attributes in {bi, &i+i,..., 6 |a'|}- We 
now dehne our instance I that gives different outputs on these orderings. 

If neither ©' nor ©' are product operators, then choose x, y such that x y ^ x y. If 
one of them is a product operator while the other is not, choose x = y = 1. Now we dehne the 
attribute domains. Let B be the set of attributes in the path from bi to bj consisting of bi, bj 
and semiring attributes in {bi,bi^i,... ,&ja'|}- For every b E B, we set T>^ = {0,1}. For every 
b' pi B, we set its to {0}. In every relation that has at least one attribute from B, it has two 
tuples. One tuple has value 0 for all attributes in B, the other has value 1 for all attributes in 
B. The values of the other attributes are of course always 0. One of the relations containing a 
attribute from B has annotation x for the tuple with Os and annotation y for the tuple with Is. 
All other annotations are 1. 

Clearly, each aggregation for an attribute b' p B is a no-op, since the domain size | = 1. 
Moreover, all aggregations other than Pi, Pj in /3 and a are also no-ops, because they are non¬ 
product aggregations (from the way we chose B) and there is a unique value of the attribute for 
each tuple it maps to after aggregation. 

Thus if both Pi and Pj are non-product aggregations themselves, then we have Q'H,a(I) = 
^ y, Qn,i3{I) = xQ'iy which are unequal due to how we chose x and y. If one of them, say Pj 
is a product aggregation, then = 1 while Q-h,p{I) = 0 (and vice versa if Pi is a product 

aggregation). This is because in P, when we do the product aggregation Pj, there is only one 
value of bj per corresponding output value, so the product annotation is 0 (and hnally the Pi 
aggregation adds two O’s to get 0). On the other hand, for a, Pj = oi happens when bj has 
two values 0 , 1 corresponding to a single output tuple, so their annotations are multiplied to get 
X ®y = 1. This shows that Algorithm is complete. ◄ 

Compatible GHDs 

Product aggregations not only change the set of equivalent orderings, but also the set of GHDs 
compatible with a given ordering. In fact, product aggregations allow us to break the rules of 
GHDs without causing incorrect behavior. In particular, we can have a product attribute P 
appear in completely disparate parts of the GHD. Thus before dehning compatibility for GHDs, 
we dehne the notion of product partitions. 

► Definition 80. Given a hypergraph H = (V,i5) and aggregation ordering a, let S = {a E 
V|(a,©) E q} be the set of attributes with product aggregations. A product partition is a set 
{Pa\ci G S} where Pa is a partition of {F E S\a E T} (the relations that contain a). 

We will duplicate each attribute a for each partition of Pa and have the partition specify 
which edges contain each instance of a. 

► Definition 81. Suppose we are given a hypergraph T-i = (V,i5), aggregation ordering a, and 
product partition P. The product partition hypergraph T-ip is the pair (Vp,i?p) such that 

- S={aE V\Pa e P} 

- Fp = (U^gg{ai,a2,...,aip^|}) U V\5 

B p : V X £ ^ Vp where p{a, F) = a it a ^ S otherwise 
Oi where F is in partition of Pa 

- = UpG£{f’(“’-^)l“ ^ 
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► Definition 82. Given a hypergraph T-L and aggregation ordering q, an aggregating generalized 
hypertree decomposition (AGHD) is a triple (T, X> P) such that (T, x) is a GHD of the product 
partition hypergraph T-Lp. 

For any attribute a in the Ajar query, TOPp^a) for an AGHD {T,XjP) can be dehned as 
the set {TO_P 7 -(ai), TOP'f{a 2 )^.. ■,TOP--r{a\p^\)}. Now we can dehne the notion of compatibility 
of an AGHD, with an ordering. 

► Definition 83. A AGHD (T, Xi P) foe s-ei Ajar query Q-H,a is compatible with an ordering 
P =-H a if for each attribute pair a, b for which there exists V\ € TOPpia), V 2 G TOPp[b) such 
that vi is an ancestor of ^ 2 , a must occur before b in the ordering /3. 


Solving Ajar queries with product aggregates 

In our proofs and discussions for the remainder of this section, we will treat the set of TOPp as 
a single element for convenience, implicitly placing an existential quantiher before the statement. 
For example, when we say TOPp{A) is an ancestor of TOPp{B), we mean 3tA G TOPp{A),tB G 
TOPp{B) such that is an ancestor of tp- We also often omit the partition P when referring 
to an AGHD G = (T, Xi P)y the partition P can be uniquely dehned by (T, x)j so we will always 
assume it is dehned appropriately. 

We can now modify our algorithm from Section |3.3| to detect equivalent orderings using 
Algorithm then search for compatible AGHDs, and run AggroGHDJoin over the compatible 
AGHD with the smallest fhw. Our runtime is given by the next theorem. Note that any AGHD 
of the original hypergraph is also a GHD of some product partition hypergraph. 


► Theorem 84 (Copy of Theorem 45 I. 

product aggregates, let w 


Given a Ajar query Qjj.a possibly involving idempotent 


be the smallest fhw for an AGHD compatible with an ordering equivalent 
to a. Then the runtime for our algorithm is 0(IN’" + OUT). 


The theorem is proved in Appendix [A] 


Decomposing AGHDs 

We can apply the ideas from Section]^ to Ajar queries with product aggregates as well. In this 
section we will assume without loss of generality that for any relation Rp, the last aggregation 
in ap is not a product aggregation. Suppose this assumption is violated, i.e. there exists some 
relation Rp such that the last aggregation in ap is the product aggregation {Ap, (X)). We can then 
immediately perform this aggregation, transforming the relation to Rp\^Ap} s-nd removing the 
product aggregation. This assumption ensrues that every relation appears in one of the subtrees 
in the decomposition dehned below. We now dehne some terms. 

Given an Ajar query Qu^a, suppose we have a subset of the nodes U G V. Dehne £v to 
be {E G S\E n U / 0}, i.e. the set of edges that intersect with V. Additionally, dehne Q-[i] 
to be a with the hrst i elements removed. We will be looking at the connected components of 
H\{V-a U PA(q)). For any connected component C, let (7+ = (7U {r> G PA(a:)|3£’ e £c ■ v ^ E}. 
Additionally, given an ordering a, we dehne based on a conditional: if ai is a product 
aggregation, let be just ap if Qi is not a product aggregations, let a^ be the set of attributes 
that can be commuted to the beginning of the ordering. To be more precise for this second case, 
given an attribute A that appears in aj with operator ©, A G if for all ai = {B, ©') such that 
i < j either ©' = © or A and B are not connected among the nodes (a_[i_i]\PA(a_[i_i])U{A, B}. 

► Definition 85. Given an Ajar query Qupj we say an AGHD {T,XyP) is decomposable if: 

B There exists a rooted subtree To of T such that x(7o) = V)—q) (i.e. output attributes). 
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B For each connected component C of 'H\{V-a U PA(q;)), there is exactly one subtree Tc £ T\7o 
such that Tc is a decomposable AGED of Q(\jE^^^E,£c),a^+ o ■ 

Then we have theorems analogous to theorems and 


► Theorem 86 (Copy of Theorem 46 1 . All decomposable AGHDs are compatible with an ordering 
(3 such that P =-K a. 


Proof. Suppose we are given an Ajar query Q-w.a and a decomposable AGED G for this query. 
We show a stronger statement: all decomposable AGEDs are compatible with an ordering P such 
that P =^1 a and PA(/3) = PA(q:) (i.e. the order of the product attributes does not change). Proof 
by induction on |Qf|. When |a| = 0, all GEDs are decomposable and all GEDs are compatible 
with a. 

Suppose |a| > 0. By definition, there is a subtree To of G such that x(7o) = V{—a). 
And for each connected component G of 'H\{V{—a) U PA(a)), we have a subtree Tc that is a 

decomposable GED for the query Q^usee E,£c),a + o ■ to denote UEeSc^ 

'“c+ 

and He to denote {Vc,£c)- Similarly, we will use a'^ to represent oiq+\^o^. By the inductive 
hypothesis, each of these subtrees Tc is compatible with some ordering P'^ such that P'^ =nc 
and PA(/3‘^) = PA(Qf'^). Note that P^ =Hc trivially implies P^ a'^. 

For each G we will construct a P^ + such Tc is compatible with P^+, P^ + =-« otc+, and 
PA(/3'^+) = PA(ac'+). Since = ctc+\aiphaO^, this requires adding the elements of to P'^. 
Dehne P^ to be some ordering of the elements compatible with G (i.e. for any A, B E V{a^+) 
if TOPj-c{A) is an ancestor of TOPp^^B), A precedes B in P^). We claim the ordering P^+ = 
pO o p^ satisfies our three conditions. 

The hrst condition is that Tc is compatible with this /3‘^+. This is trivially true because we 
constructed the ordering by adding output attributes to the start of P^ , with which Tc is already 
compatible, in an order that is guaranteed to be compatible. 

The second condition is that /3‘^+ =h ac+- By the dehnition of a0+, ac+ =h OiQ+ ° of ■ 
We know P^ =-u cP by the inductive hypothesis. And we claim =-k which implies 

P'^+ =Ei ac+ by dehnition. We show this claim by showing that the operators of are 
uniform, implying that its elements can be reordered freely. In particular, consider the hrst 
element (Ai,©i) of ac+- Since G"*" is a connected component, there must exist a path between 
Ai and every other node among the nodes G+. Thus, for any (B, ©') 6 ac+ such that ©' ©i, 

Ai will violate the path condition for commuting and ensure B ^ V{a^+). 

The third condition is that PA(/3'^+) = PA(ac+). By the inductive hypothesis, PA(/3'^) = 
PA(a®^). We simply need to show PI\[Pco) = PA(ap+). There are two cases to consider, from the 
dehnition of cAq+- In the hrst case, both Pco) and 0 ®+ contains only one (product) aggregation. 
In the second case, the two orderings have no product aggregations. In either case, PA(/3c’o) = 
PA(Qf0+) trivially. 

We now need to combine the P^ + for each G to construct the desired ordering P as desired. 
We construct P by repeating the two following steps algorithm until every P^ is empty: (1) remove 
the non-product output prehxes of P^ + and append them to P (interleaved arbitrarily) and (2) 
remove the earliest remaining product aggregation of PA(q) from the start of the appropriate 
P‘^-\- and append it to p. Note that this procedure ensures Pc+ = P^-\- for each G, which implies 
Pc^ ='H C‘C+ and (by the soundness of Algorithm P =-h a. Also note that the procedure 
preserves the ordering of the product aggregates, so PA(/3) = PA(a). Finally, the given AGED G 
must be compatible with /3. The construction of G ensure the top nodes of output attributes are 
all above the top nodes of non-output attributes, and the top nodes of non-output attributes are 
in the subtrees Tcj which means the fact that Pc+ = P^ + ensures these top nodes are ordered 
in a compatible manner. ◄ 
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► Theorem 87 (Copy of Theorem 47 |. For every valid AGED (T, x): there exists a decomposable 
{T',x') such that for all node-monotone functions 7 , the 'y-width of{T',x') ts no larger than the 

X-width of{T,x)- 


Proof. We first modify the definition of snbtree-connected from Appendix [Pj 
B subtree-connected: for any node t E T and the snbtree 7t rooted at t, consider the set the 
attributes Vt = {v E VlTOP-fiv) E 7t}; we require for any two attributes A,B E Vt, there 
exists a path from A to i? in the set (I4(\PA(a)) U {A, B}. 


This same transformation described Lemma [ tT] can be used for this adjusted dehnition. Note 
that this transformation ensures that any node that is TOPj- for a product aggregation has oniy 
one chiid. Also note that the described transformation might change the partition function P of 
the AGHD, but it does not change the compatible order. 

Suppose the given Ajar problem is Q-h.q. Since (T, x) is valid, there must exist an ordering 
P such that (T, x) is compatible with /? and alpha p. The width-preserving transformations 
of Appendix ^ preserve the compatibility with an ordering. So we can apply them to get a 
TOP-unique and subtree-connected AGHD (T',xO is compatible with P and has 7 -width 
no larger than that of (T, x)- We claim that this AGHD is decomposable. 

As in Appendix]^ we prove that any valid, TOP-semiunique, and subtree-connected GHD 
for an is decomposable. Proof by induction on |a!|. If |a = 0|, then every GHD is decomposable. 

Suppose |a| > 0. Gonsider the set of nodes that are TOPj- nodes for output attributes, 
i.e. {t E T|3A 6 V{—a) : TOPr{A) = t}. Since (T',x0 is compatible with /3, no non¬ 
output attributes can have a top node above an output attributes top node. Thus, the TOP- 
semiunique property guarantees that this set of nodes forms a rooted subtree To of T such that 
x(ro) = V{-a). 

Gonsider the subtrees in T\To- Gall them 7i, T^, • ■ •, Tj,. For any 71, let Vi be the attributes 
that have TOPj- nodes in Tj, i-e. Vi = {A E V\TOPj-{A) E 7j}. None of these Vi can contain any 
output attributes, and connected-subtree guarantees that each of the Vi are connected. Thus, the 
Vi must be the as dehned earlier. So for each connected component C of Tf\(H(—ajU PA(q:)), 
the corresponding subtree Tc is the subtree Tj such that Vi = G"*". Since for any A E C, 
TOPj-(A) E Tc, the attributes in G only appear in Tc- Note that for every edge E E 8, there 
exists a node t E T such that E C x{t)- This implies that for every edge E E £c, there exists 
a node t E Tc such that E C x{t)- As such, we can conclude that each 7c is a GHD for the 
hypergraph (Ucegc 

Dehne Vc = Uceec complete this proof, we now need to show that each 7c is a 

decomposable GHD for the Ajar query (5(Vc,£c),a + o • inductive hypothesis, if Tc 

'“c+ 

is valid, TOP-semiunique and subtree-connected, it must be decomposable. Note that since 
T is TOP-semiunique and subtree-connected, Tc must also be TOP-semiunique and subtree- 
connected. We have also established that 7c is a GHD for {Vc,8c)- Thus to hnish this proof, 
we only need to show that there exista an ordering P' such that P' =(y^ and Tc is 

compatible with P'. 

We know T is compatible with P and P = a. We set P' = PQ+ypo^-, this implies that 
P' =ji otQ+y^o since left hand and right hand sides are simply sub-orderings of beta and a, 
respectively. Furthermore, this implies P' aQ+y^o^, as (Vc,i?c) is simply E with some 

output attributes (of P') removed. 

We now need to show that Tc is compatible with P'. In other words, we need to show for 
any two attributes A, B E Vc, if TOPj-^{A) is an ancestor of TOPj-^{B), either A is an output 
attribute or A precedes B in P'. We show the contrapositive: if A is not an output attribute 
and A does not precede B in /?', then TOPj-^{A) is not an ancestor of TOPj-^{B). There are 
a couple of cases to consider. It B E Vc\G+, B must be in V{—a), implying TOPj-^{B) is the 
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root of 7c- We note that for attributes in (7+, TOPj-c and TOPj- are equivalent, so we use 
them interchangeably. If B e then we know B must precede A\n jj', which implies B 

precedes A in p. The fact that T is compatible with B implies TOP'j-{A) is not an ancestor of 
TOP'f{B). The hnal case to consider is B e Pc+- 

Even in this case, we have two cases to consider, based on the two dehnitions of Pq+. If B 
has a product aggregation, then B must be the hrst element of Pc+ ■ This implies B precedes A 
in /3, guaranteeing that TOP'f{A) is not an ancestor of TOPt{B). The other case is a bit more 
involved. 

Assume for contradiction that there exist A, B such that TOPt{A) is an ancestor of TOP'f{B), 
B 6 , and A 6 /3'. We hrst claim that, without loss of generality, we can suppose that A and 

B have different operators. To do so, we show that if A and B have the same operator, there 
must exist a A' 6 /3' with a different operator such that TOPt{A') is an ancestor TOPt{B). 
The fact that A ^ implies there is an attribute A' with a different operator such that there 
exists a path between A' and A composed of attributes that appear after A' in Pc+ ■ We claim 
TOP'i{A') is an ancestor of TOP'i{A), which implies TOPt{A') is an ancestor of TOPt{B). 
Suppose the path between A! and A is Xq, Xi, X 2 , ■ ■ ■, X^ where A' = Xq and A = X^; we will 
show TOP-j-{A') is an ancestor of TOPt-{A) by showing TOP'j-{A') is an ancestor of all Xi for 
i > 1. Proof by induction on i. For i = 1, A' and Xi share an edge, implying they appear in 
x{t) together for some tree node t. By dehnition, TOP'riyA') and TOP'j-{Xi) are both ancestors 
of t. Since T is TOP-semiunique (so TOP'i-{A') / TOP'f{Xi)) and T is compatible with /3 (so 
TOP'f{Xi) cannot be an ancestor of TOP't-[A'), this means that TOP'r[A') is an ancestor of 
TOP']-(Xi). For i > 1, we know that Xi-i and Xi share an edges, implying they appear together 
in x(f) for some tree node t. TOP'r[Xi) and TOP'f{Xi-i) must both ancestors t. Note that 
the inductive hypothesis gives us that TOPt(A’) is an ancestor of TOPj-iXi-i), implying it 
is an ancestor of t. By the same logic as before, this implies that TOPt-{A) is an ancestor of 
TOP-r^Xi). We thus have that TOP^fiA') is an ancestor of TOP'r{A). 

We now suppose, without loss of generality, that A and B have different operators. Since 
TOPq-{A) is an ancestor of TOP'f{B)^ we know A comes before B in the compatible ordering /3. 
However, the fact that B 6 implies that every path between B and A includes an attribute 
X that is either an output attribute or comes before B in p. Either way, none of these X is in 
the subtree rooted at TOP-r{A), implying that A and B are disconnected in the subtree rooted 
at TOP'f{A). This contradicts the subtree-connected property. 

► Definition 88. Given an Ajar problem QH,a, suppose Gi,..., Ga, are the connected compo¬ 
nents of B \ (V_a U PA(a)). Dehne a function H{'H,a) that maps Ajar queries to a set of 
hypergraphs as follows: 

™ = UbgCc B for all 1 < * < fc 

- 7^0 = (V-„, {Fe£\FC V_„} U {V_„ n G++|l < i < fc}) 

- B+ = (G++,£:cU{V_„nG+}) 

- Bin, a) = {Bo} UUi<,<fc-ff(^+.«cW«-J 

The hypergraphs in the set i7(B,a) are dehned to be the characteristic hypergraphs. 

► Theorem 89 (Copy of Theorem [4^ . For an Ajar query Qu.a involving product aggregates, 
suppose Bo, • ■ •, T-ik are the characteristic hypergraphs F[{'H, a). Then AGHDs Go, Gi,... ,Gk of 
Bo, • ■ •, TLk can he connected to form a decomposable AGED G for Qn.a- Gonversely, any decom¬ 
posable AGED G of Q-kia can be partitioned into AGEDs Go,Gi,...,Gfc of the characteristic 
hypergraphs Bo, . ■ ., Bfc. Moreover, in both of these cases, x-width{G) = max^ j-width^Gi). 

Proof. The proof is the exact same as the proof of Theorem |31| provided in Appendix [P] -4 

This lets us apply all the optimizations from Section |5.2| |5.3| and |5.4| to Ajar queries with 
product aggregates. 
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Comparison to FAQ 

The runtime of InsideOut on a query involving idempotent product aggregations is given by 
where the faqw depends on the ordering, and the presence of product aggregations. 
Our algorithm for handling product aggregations recovers the runtime of FAQ. Formally, 

► Theorem 90. For any Ajar query involving idempotent product aggregations, IN’" + OUT < 

2 • 


The proof is in Appendix |B.1[ By applying ideas from the FAQ paper to our setting, we can 
also recover the FAQ runtime on ^QCQ (Appendix |E.3[ ). Our algorithm for detecting when two 
orderings involving product aggregates are equivalent (Algorithm]^ is both sound and complete; 
in contrast, FAQ’s equivalence testing algorithm is sound but not complete. Moreover, we have a 
width-preserving decomposition for queries with product aggregates. This allows us to apply all 
the optimizations from Section giving us tighter runtimes in terms of submodular and DBP- 
widths (Theorems [M 391 and efficient MapReduce Algorithms (Theorems 40 41 1 . As shown 
before, FAQ gives a worse runtime exponent in each of these cases. 


E.3 Recovering #QCQ 


We discussed idempotent product aggregations and how they can help Ajar generalize QCQ in 
Section There is a variant called ij=QCQ in which solutions are expected to output the number 
of solutions to a given QCQ (instead of the solutions themselves). At first this seems like a fairly 
straightforward extension to QCQ. If we use Ajar to solve a given QCQ, the output is a relation 
that lists the satisfying assignments, where each tuple’s annotation is 1; to count the number of 
tuples, we simply need to prefix the QCQ query with aggregations using the operator -|-. 

An issue arises because these new aggregations need to occur in in the domain Z+ (the non¬ 
negative integers) instead of {0,1}. Though (Z+,max, •) is still a semi-ring, the product aggre¬ 
gations are no longer idempotent in the given domain; we discuss how to handle non-idempotent 


aggregation in Appendix E.4 but the added complexity (and runtime) required to deal with 


non-idempotent aggregations seems unnecessary in our case. Even though multiplication is not 
idempotent over the larger domain, we can guarantee that it is idempotent whenever a product 
aggregation occurs; the annotations do not leave the { 0 , 1 } domain until the + aggregations, 
which must occur after the product aggregations. 

To handle this extra structure, we introduce the concept of specifying restricted domains in 
Ajar queries. To recover ^QCQ, we translate the approach of FAQ |15| Section 9.5], which is 
the minimal application of the restricted domain concept to Ajar queries. 


► Definition 91. Given a domain K and operator set O, we define a restriction to subsets of the 
domain K,. C K and operator set Or F O such that {0, 1} C K^, 0 E Or and for any a, 6 £ 
and © £ Or, aQb E K,.. 

► Example 92. In the context of ij=QCQ, IK = Z+ and O = {-|-,max, ©}. The restriction is 
Kr = ( 0 , 1} and Or = {max, ©}. 

Note that if we ensure that the specified operators are closed in the restricted domain, the 
semiring properties will all hold in the restricted domain. We then dehne an aggregation order¬ 
ing that incorporates these restrictions - we will define an index I divides the unrestricted and 
restricted portions of the ordering. 


► Definition 93. Given an attribute set V, domain K, operator set O, and restriction K,. and 
Or, an restriction-compatible aggregation ordering is an aggregation ordering a and index I such 
that 1 < ^ < |a| and for each k > I, ak = {A, 0 ) for A £ V and 0 £ Or- 
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Any single operator © that appears both before and after the division index I will be treated 
as different operators (this issue does not come up in the context of i^QCQ). We can then dehne 
an Ajar query to use a restriction-compatible ordering, and any instance of the query must 
have K^-relations. Under this dehnition, we can treat the product aggregations as idempotent, 
allowing us to use the work in Section l^to recover jj=QCQ. 

This set-up is essentially a translation of FAQ’s results to our language/notation. Using 
the exact same construction described in the previous Appendix section, we can now recover 
FAQ’s runtime on jj=QCQ as well. We note that we could extend this idea of restricting domains 
even further by relying on our GHDs. In particular, we can have every single element of the 
aggregation ordering specify its own domain, and a valid GHD would have to ensure that for any 
A,B such that TOP--r{A) is an ancestor of TOP'r{B), the semiring domain corresponding to A 
is a superset of the semiring domain corresponding to B. 

E.4 Non-ldempotent Product Aggregations 

Our AggroYannakakis algorithm actually implicitly assumes that any product aggregation that 
arises consists of an idempotent operator. 

► Definition 94. Given a set S, an operator © is idempotent if and only if for any element a E S, 
a® a = a. 

This is a reasonable assumption, as the problems that we’ve discovered using product aggre¬ 
gation all tend to have idempotent products. The key difference between an idempotent and 
non-idempotent operator is the distributive property; (a © 6) © (a © c) = a © (6 8 c) only if © is 
idempotent. Note that the non-idempotent case would require an a^. So, to be complete, we can 
support non-idempotent operators by raising the annotations of every other relation to a power. 
In particular, if we have a non-idempotent aggregator for an attribute A, we should raise the 
annotations for the relations in every other node in our tree to the \T>^\ power when we aggregate 
the attribute A away. 



Extension: Computing Transitive Closure 


A standard extension to the basic relational algebra is the transitive closure or Kleene star 
operator. In this section, we explore how our framework for solving Ajar queries can be applied 
to computing transitive closures. First we dehne the operator using the language of Ajar. Given 
a relation R with two attributes, consider the query 

Qk — ^ ^ ^ ^ ^ l<i<k R{Ai , ) 

where each of R[Ai, are identical copies of R with the attributes named as specihed. Note 
that our output Qk is going to be a two-attribute relation. Suppose there exists some k* such 
that Qk is identical for all k > k*. We can then dehne the transitive closure of a relation R, 
denoted R*, to be Qk’- 

This classic operator has natural applications in the context of graphs. If our relation R 
is a list of (directed) edges (without meaningful annotations), computing R* is equivalent to 
computing the connected components of our graph. If we add annotations over the semiring 
(Z U {oo}, min, -f) where each edge is annotated with a weight, then computing R* is equivalent 
to computing all pairs shortest paths [^. Note that we can guarantee R* exists as long (i) our 
graph contains no negative weight cycles and (ii) our relation contains self-edges with weight 0. 
We will discuss computing R* in the context of graphs, applying it to the all pairs shortest path 
problem. Let E be the number of edges and V the number of nodes in the graph; we will derive 
the complexity of computing all pairs shortest paths in terms of E and V. 
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A naive algorithm for finding R* is to compute Qi, Q 2 , Q 4 , ■ • • until we find two consecutive 
results that are identical. This approach requires answering 0(logfc*) Ajar queries. In the 
context of all pairs shortest path, we know k* < V, which means that the number of queries to 
answer is O(logy). We start by analyzing the computation required to answer a query of the 
form Q 2 "- 

We dehne the GHD to use for ( 52 " recursively. Our base case, when n = 1, is to have a 
single bag containing all three attributes Ai, A 2 , A 3 . For n > 1, the root of our GHD will contain 
the attributes Ai, A 2 n-i+i, A 2 "+i. It will have two children: on the left, it will have the GHD 
corresponding to ( 5 jj 2 "-ii and on the right it will have an identical GHD over the attributes 
A 2 n-i_|_i, A 2 n-i+ 2 ,..., A 2 "+i instead of the attributes Ai, A 2 ,..., A 2 n-i+i. Note that each bag 
of our constructed GHD has 3 attributes, but they may not appear in any relation together. 
Additionally, note that the depth of our GHD is simply n. 

If we naively apply the AGM bound to derive the fractional hypertree width, we get a width 
of E^. However, if, for each attribute A^, we (virtually) create a relation S'(Ai) of size H, our 
fractional hypertree width becomes V^. Alternatively, we can also use DBF-width to derive the 

bound without introducing these relations. 

Applying the results of GYM gives us that we can answer (52" in 0{n) MapReduce rounds 
with 0{V^) communication cost. Given that we need to answer (9(logH) of these queries and 
that n < O(logH) for each of these queries, we have a 0{\og^V) rormd MapReduce algorithm 
with 0{V^) total communication cost for all pairs shortest paths, which is within poly-log factors 
of standard algorithms for this problem. 

In addition, if we allow a 0{k*logk*) round MapReduce algorithm, we can reduce the total 
communication cost to 0(EV) by using a chain GHD. In particular, for a query Qk, the GHD 
will be a chain of k bags such that the bag in our chain consists of Ai, A^+i and A^+i. This 
construction ensures that two of the three attributes in each bag appear in a relation together, 
reducing the width to EV. 

We note that we derived this MapReduce bound with our generic algorithms, without any 
specialization for this particular problem. We can also derive a serial algorithm for the problem 
with the same bound, but it requires a small optimization. By construction, our (original, non¬ 
chain) GHD has the property that every subtree whose root is at a particular level is completely 
identical. This means that AggroGHDJoin does not need to visit each bag; it simply needs to 
visit one bag per level, and then assign the result to the other bags on the level. With this 
optimization, our algorithm computes all pairs shortest paths in O(V^), again within poly-log 
factors of specialized graph algorithms. 


